system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The bilingual dialogue system addresses language learning challenges by using native and target languages, real-time translation, and interactive environments to enhance comprehension and motivation.

JP2026096665APending Publication Date: 2026-06-15SOFTBANK GROUP CORP

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: SOFTBANK GROUP CORP
Filing Date: 2024-12-03
Publication Date: 2026-06-15

Application Information

Patent Timeline

03 Dec 2024

Application

15 Jun 2026

Publication

JP2026096665A

IPC: G09B19/06; G06F40/58; G06F3/16; G10L15/00; G10L25/60; G06Q50/10

AI Tagging

Application Domain

Natural language translation Data processing applications

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Current foreign language learning systems pose high psychological and linguistic hurdles for beginners due to exchanges only in the target language, leading to reduced motivation and efficiency.

Method used

A bilingual dialogue system using both the user's native language and the target language, with real-time translation, grammatical and pronunciation feedback, and virtual/augmented reality environments for interactive learning.

Benefits of technology

Enhances language comprehension and acquisition by providing personalized learning plans and emotionally responsive interactions, improving user motivation and learning effectiveness.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 2026096665000001_ABST

Patent Text Reader

Abstract

This system aims to reduce communication difficulties faced by beginners in foreign language learning, as well as the lack of understanding that arises from monolingual learning methods. [Solution] The specific processing unit 290 of the data processing device 12 in the system performs the following: setting processing to enable the selection of the native language and the language to be learned; processing to analyze the user's voice input and convert it into text data by a voice recognition means; generation processing to create an appropriate response based on the text data; translation and display processing to translate the generated response into the native language and the language to be learned and display it; analysis and feedback processing to analyze the grammar and pronunciation of the user's utterances and provide feedback; environment provision processing to provide a virtual reality or augmented reality environment and enable the user to interact; and recording and plan provision processing to record the user's learning progress and provide an individualized learning plan.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor, the method including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] In foreign language learning, it is to reduce communication difficulties faced by beginners and lack of understanding associated with a learning method using only a single language. In particular, many current systems are based on exchanges only in the target language to be learned, which poses high psychological and linguistic hurdles for users at the initial stage of learning. As a result, users are likely to lose motivation for language acquisition, leading to problems of reduced efficiency.

Means for Solving the Problems

[0005] This invention provides a bilingual dialogue system that uses both the user's native language and the target language. Specifically, it includes means for analyzing voice input based on the language set by the user and automatically generating responses using generative AI technology. At the same time, the responses are translated in real time into the native language and the target language and displayed as subtitles, thereby assisting language comprehension. Furthermore, it includes a function to analyze grammatical and pronunciation errors in the user's speech and provide feedback, thereby improving learning effectiveness through actual dialogue. By utilizing virtual reality and augmented reality environments to enable more realistic interactions and proposing individualized learning plans based on learning progress, the invention aims to support efficient language acquisition.

[0006] "Settings" refers to a function used by users to select their native language and the target language they wish to learn.

[0007] "Voice recognition means" refers to technology that analyzes a user's voice input and converts it into text data.

[0008] A "generation method" is a process for automatically creating an appropriate response based on analyzed text data.

[0009] "Translation and display means" refers to a function that translates the generated response into the user's native language and the target language, and displays it visually to the user.

[0010] "Analysis and feedback means" refers to a function that identifies grammatical and pronunciation errors in user utterances and informs the user of areas for improvement.

[0011] "Environment provision means" refers to technologies that provide users with a virtual reality or augmented reality environment to realize a learning experience that is closer to reality.

[0012] A "recording and plan provision system" is a system for tracking a user's learning progress and proposing a customized learning plan based on that data. [Brief explanation of the drawing]

[0013] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14] This is a sequence diagram showing the processing flow of the data processing system in Application Example 2, which combines an emotion engine. [Modes for carrying out the invention]

[0014] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0015] First, the terms used in the following description will be explained.

[0016] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0017] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0018] In the following embodiments, the numbered storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, etc.

[0019] In the following embodiments, the signed communication interface (I / F) is an interface that includes a communication processor and an antenna, etc. The communication interface manages communication between multiple computers. Examples of communication standards applicable to the communication interface include wireless communication standards such as 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).

[0020] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0021] [First Embodiment]

[0022] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0023] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0024] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0025] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0026] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0027] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0028] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0029] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0030] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0031] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0032] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0033] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0034] This invention relates to a bilingual AI teacher system capable of engaging in dialogue in both the user's native language and the target language. This system provides interactive conversations with the user based on the language settings selected by the user. Specific embodiments of the system are described below.

[0035] When a user first accesses the system, the terminal prompts the user to select their native language and the foreign language they wish to learn. This information is sent to the server, which then retrieves the appropriate language resources from its database based on the selections.

[0036] To perform voice input, the user speaks through the microphone. The device acquires the voice data and sends it to the server. The server uses advanced speech recognition technology to convert the speech into text and analyzes its content. Based on this analysis, a generative AI model creates an appropriate response to the user's utterance.

[0037] The generated response is translated in real time by the server. The translated text is available in both the user's native language and the target language, and is sent to the terminal. The terminal displays both languages as subtitles on the user's screen. For example, if the user says "Hello, how are you?", the system will display "Hello, how are you?" as subtitles.

[0038] The server also analyzes the user's language use and identifies grammatical and pronunciation errors. The identified problems, along with suggestions for improvement, are sent to the device as feedback. This allows users to receive excellent feedback on their speech and effectively advance their learning.

[0039] This system can also utilize virtual reality (VR) and augmented reality (AR) environments. When users utilize these environments, the server can provide virtual scenes such as restaurants and airports, allowing them to simulate English conversations within those settings.

[0040] The server continuously records the user's learning progress. Based on this, the server provides the user with a learning plan optimized for the next learning session. The system stores the user's learning history in a database and analyzes this history to create a personalized learning plan. For example, if the user has a weak understanding of certain phrases or words, the next session will be adjusted to focus on those areas.

[0041] This system configuration allows users to learn a foreign language more effectively while receiving support in their native language.

[0042] The following describes the processing flow.

[0043] Step 1:

[0044] The user logs into the system and selects their native language and the language they wish to learn. The terminal receives this selection information and sends it to the server.

[0045] Step 2:

[0046] The server retrieves appropriate language resources from the database based on the user's selected language settings. This allows the system to provide an interface tailored to the user's language environment.

[0047] Step 3:

[0048] The user uses a microphone to speak. The device records the user's voice and sends the audio data to the server.

[0049] Step 4:

[0050] The server converts the received audio data into text using advanced speech recognition technology. This process makes the content of the audio analyzable as text data.

[0051] Step 5:

[0052] The server analyzes the text data converted from the speech and generates an appropriate response using a generative AI model. This response is precisely constructed based on the user's questions and utterances.

[0053] Step 6:

[0054] The server translates the generated response in real time into the user's native language and the language they are learning. The translated text data is then sent to the terminal.

[0055] Step 7:

[0056] The device displays the received translation data as subtitles on the user's screen. This allows the user to see the response to their speech in both their native language and the language they are learning.

[0057] Step 8:

[0058] The server analyzes the user's speech and identifies grammatical and pronunciation errors. It then sends feedback on the identified errors and suggestions for correction to the terminal.

[0059] Step 9:

[0060] The device displays feedback from the server to the user, supporting the user's learning improvement. This process allows the user to improve their language skills in real time.

[0061] Step 10:

[0062] If the user is using a virtual reality (VR) or augmented reality (AR) device, the server provides interaction within a virtual environment. This offers the user the opportunity to learn the language practically in a simulated setting.

[0063] Step 11:

[0064] The server tracks the user's learning progress and stores it in a database. Based on this data, it generates a personalized learning plan for the next learning session and proposes it to the user via the terminal.

[0065] (Example 1)

[0066] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0067] In language learning, effective support is needed to enable smooth communication in both the native language and the target language. Furthermore, real-time feedback and personalized learning plans tailored to learning progress are required. In particular, the immediate correction of pronunciation and grammatical errors, and the development of an interactive and enriching learning environment are key challenges.

[0068] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0069] In this invention, the server includes a setting means that allows the user to select their native language and the language to be learned, a means that analyzes the user's voice input and converts it into text data using speech recognition technology, and a generation means that creates an appropriate response based on the text data. This allows the user to effectively learn a foreign language while receiving translated responses in real time and obtaining feedback on pronunciation and grammar.

[0070] "Native language" refers to the language that a user uses on a daily basis.

[0071] The "target language" is the language that the user is trying to learn.

[0072] "Setting means" refers to a device or program that provides an interface and functions for the user to select a language.

[0073] "Voice input" refers to spoken data that a user makes to a device.

[0074] "Speech recognition technology" is a technology for converting speech data into text data.

[0075] "Generation means" refers to a device or program for generating an appropriate response to user input.

[0076] "Translation and display means" refers to a device or program that translates a generated response into another language and displays it on a visual device.

[0077] "Analysis and feedback means" refers to a device or program that has the function of analyzing the content of a user's speech and providing methods for correcting or improving errors.

[0078] "Environment provision means" refers to a device or program that provides users with a virtual or augmented real world and enables interaction.

[0079] "Recording and planning means" refers to a device or program that has the function of recording the user's learning progress and creating and providing an individualized learning plan based on that progress.

[0080] The "function to display as subtitles" refers to a technology or program that displays responses generated in real time as text on the screen.

[0081] This system consists of electronic devices and software that provide dialogue in the user's chosen native language and target language. Processing is primarily handled through the cooperation of a server and a terminal. Details are provided below.

[0082] First, when a user accesses the system, the terminal displays a settings screen through the user interface, allowing the user to select their preferred language. The language information selected by the user is then sent from the terminal to the server. The hardware used at this stage consists of a display device and an input device.

[0083] Next, the server reads the necessary language resources from a database corresponding to the received language settings. This process uses a database management system and memory caching technology. The server then prepares to analyze the voice input from the user and uses speech recognition technology (e.g., a speech recognition API) to convert the voice data from the device into text data.

[0084] The voice data spoken by the user through the microphone is converted into a digital format on the device and sent to the server in real time. The server analyzes this voice data using advanced speech recognition technology and generates an appropriate response using a generative AI model. The AI model used is a general natural language processing model.

[0085] The generated response is translated on the server, prepared in both the user's native language and the target language, and sent to the terminal. The terminal displays this response as subtitles on a visual display device. For example, if the user says "Hello, how are you?", the terminal will display "Hello, how are you?". This allows the user to see bilingual responses in real time.

[0086] Furthermore, the server analyzes the user's speech to identify grammatical and pronunciation errors and sends feedback to the device. This feedback includes specific suggestions for improvement. The device displays this on its screen, encouraging the user to improve their speaking skills.

[0087] The system can also provide virtual reality and augmented reality environments, allowing users to have an interactive learning experience. The server generates simulation scenes, enabling conversation practice tailored to the user's learning level.

[0088] The user's learning progress is continuously recorded, and the server uses this data to generate a personalized learning plan, which is then provided in the next learning session. For example, if the user has a weak understanding of a particular phrase or grammatical structure, the server will recommend learning content that focuses on that area.

[0089] Example of a prompt:

[0090] "Translates user utterances in real time and generates responses."

[0091] "Recognizes user speech and provides pronunciation feedback."

[0092] Therefore, users can effectively learn foreign languages using this system.

[0093] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0094] Step 1:

[0095] When a user accesses the system, the terminal displays a screen for selecting their native language and the language they wish to learn. The user selects a language through the provided interface and confirms their selection. This user input is sent to the server as language setting information. The input is the user's language selection, and the output is the language setting information.

[0096] Step 2:

[0097] The server reads relevant language resources from the database based on the received language setting information. This database includes dictionary data, conversation templates, and other similar resources. The server caches this data in memory to enable fast data access. The input is language setting information, and the output is language resources.

[0098] Step 3:

[0099] The user inputs voice through the microphone. The terminal acquires this voice data, converts it to a digital format, and then sends it to the server. The input is the user's voice, and the output is digital voice data.

[0100] Step 4:

[0101] The server converts received digital audio data into text using speech recognition technology. It utilizes a speech recognition API to analyze the audio signal and replace it with corresponding text data. This analysis process involves extracting acoustic features and matching phonemes. The input is digital audio data, and the output is text data.

[0102] Step 5:

[0103] The server uses a generative AI model to generate responses based on the analyzed text data. It applies natural language processing algorithms to determine appropriate communication content based on the context. The input is text data, and the output is a generated response.

[0104] Step 6:

[0105] The server translates the generated response. Using a translation API, it converts the response to the user's native language and the target language, preparing support in both languages. The input is the generated response, and the output is the translated response.

[0106] Step 7:

[0107] The terminal receives the translated response and displays it on the screen as subtitles. This process involves real-time subtitle generation and display using a visual display device. The input is the translated response, and the output is the subtitle display.

[0108] Step 8:

[0109] The server analyzes the grammar and pronunciation of spoken content and provides feedback on areas for correction and improvement. It uses language analysis tools to identify pronunciation errors and calculate methods for improvement. Input is user speech data, and output is feedback information.

[0110] Step 9:

[0111] The server generates data to provide virtual reality and augmented reality environments, offering users an interactive experience. This environment data changes dynamically based on the user's learning. The input is the user's learning needs, and the output is the virtual environment data.

[0112] Step 10:

[0113] The server records the user's learning progress and creates a personalized learning plan. It analyzes past data and suggests appropriate learning materials and practice content for the next session. The input is learning history data, and the output is the learning plan.

[0114] (Application Example 1)

[0115] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0116] Currently, foreign language learning at home typically involves using textbooks and audio materials. However, effectively acquiring the practical skills needed in real-life conversational situations is difficult. In particular, there is a lack of interactive learning support using humanoid robotic devices, and there is a need to provide an environment where users can easily learn foreign languages practically at home.

[0117] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0118] In this invention, the server includes setting means that allows the user to select their native language and the language to be learned; means that analyze the user's voice input and convert it into text data using speech recognition means; translation and display means that translate the generated response into the native language and the language to be learned and display it; and means that control a humanoid automated device that supports voice dialogue in the home and provides simulated conversation scenes. This enables the user to engage in practical foreign language learning in the home.

[0119] "Setting means" refers to a function that allows the user to select their native language and the foreign language they wish to learn.

[0120] "Speech recognition means" refers to technology that analyzes speech data obtained from a user and converts it into text data.

[0121] The "generation means" refers to a function that generates an appropriate response based on text data obtained through speech recognition.

[0122] "Translation and display means" refers to technology that translates the generated response into the native language and a foreign language and presents it visually.

[0123] "Analysis and feedback means" refers to a function that evaluates the grammar and pronunciation of spoken content and provides suggestions for improvement.

[0124] "Environment provision means" refers to technologies that construct scenarios in which users can interact through virtual or augmented reality.

[0125] "Recording and plan provision means" refers to a function that saves the user's learning progress and suggests personalized learning content.

[0126] "Means for controlling humanoid automated devices" refers to the technology for operating automated devices that provide simulated conversation scenarios within the home.

[0127] To implement this invention, a humanoid automated device for home use and a server to control it are required. The server uses the following main hardware and software: a "speech recognition library" for speech recognition, and for generation AI, for example, "GPT-3 (registered trademark)" or "BERT". A "VR / AR engine" is used to construct virtual scenes.

[0128] First, the user selects the foreign language they wish to learn and their native language using the settings function of the humanoid robot. Voice input is sent to the server via a microphone and converted into text data by speech recognition technology. The server then uses this data to generate appropriate dialogue using a generative AI model and translates it in real time into both selected languages. This translation is returned to the user visually or audibly through the robot.

[0129] The server also analyzes the grammar and pronunciation of the user's speech and provides feedback. Using virtual or augmented reality, it constructs scenarios in which the user can interact with various situations, such as ordering at a cafe or checking in at an airport. Learning progress is recorded and used as data to personalize the content of the next learning session.

[0130] For example, if a user wants to practice ordering coffee at a cafe, the server will virtually provide that scenario and present prompts to simulate the conversation through an automated device. By using a prompt such as, "You play the role of a cafe employee and have the user order coffee. Please gently correct any pronunciation mistakes," the user can practice speaking.

[0131] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0132] Step 1:

[0133] The user selects their native language and the foreign language they wish to learn from their device. The input is the user's language selection, and the output is the selected language settings information. The server receives this information and retrieves the appropriate language resources from its database.

[0134] Step 2:

[0135] The user speaks into the microphone. The input is the user's voice data, and the output is the voice data sent to the server. The terminal acquires this voice data and sends it to the server.

[0136] Step 3:

[0137] The server uses a speech recognition library to convert the received audio data into text data. The input is audio data, and the output is the converted text data. Speech recognition transcribes the user's speech into text.

[0138] Step 4:

[0139] The server uses a generative AI model to analyze text data and generate appropriate responses. The input is the text data to be analyzed, and the output is the generated response. Based on this analysis, a response that corresponds to the dialogue is formed.

[0140] Step 5:

[0141] The generated response is translated into the user's native language and the target foreign language via a translation function. The input is the generated response, and the output is the response translated into both languages. The translated content is delivered to the device and displayed to the user.

[0142] Step 6:

[0143] To provide feedback on user speech, the server performs grammatical and pronunciation analysis. The input is the user's text data, and the output is feedback information. The analysis identifies pronunciation and grammatical errors, and suggests ways to improve them.

[0144] Step 7:

[0145] The server utilizes a virtual or augmented reality environment to provide the user with a specific scene. The input is a request for scene selection, and the output is the generated virtual environment. The terminal then provides this scene to the user to facilitate interaction.

[0146] Step 8:

[0147] The user's learning progress is recorded, and the next learning plan is personalized. The input is the recorded learning data, and the output is an optimized learning plan. The server analyzes the data and creates a plan that will be useful for the next learning session.

[0148] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0149] This invention combines an emotion recognition function with a bilingual AI system that engages in dialogue using the user's native language and the target foreign language. The aim of this system is to improve the user's language learning experience.

[0150] When a user first accesses the system, the terminal prompts them to select their native language and the foreign language they wish to learn. This selection information is sent to the server, which retrieves appropriate language resources from its database. This is to provide a language interface optimized for the user.

[0151] When a user performs voice input, their speech is recorded using a microphone. The device sends this voice data to a server, which converts the speech into text data using speech recognition technology. Based on this text data, a generative AI model is used to generate an appropriate response that corresponds to the user's speech.

[0152] The generated responses are translated in real time into the user's native language and the target language, and sent from the server to the terminal. The terminal displays this on the user's screen in both languages, allowing the user to see the conversation in both languages. For example, if the user asks "How are you doing today?", the system will display "How are you?".

[0153] A key element of this invention is the emotion engine, which analyzes the user's speech content along with their voice tone and facial expression data to recognize the user's current emotional state. This is done through camera and voice analysis technology. The recognized emotion is sent to the server and reflected in the tone and content of the generated response. For example, if the user speaks in a discouraged voice, the server will generate a response that is more encouraging and kind.

[0154] Furthermore, this emotional data is recorded along with the user's learning history. Based on this history, the server analyzes how the learning content emotionally affected the user. The results of this data analysis help design individual learning plans, for example, suggesting learning materials and simulations that enhance enjoyment and excitement for users whose modifier level is low.

[0155] Virtual reality and augmented reality environments allow users to practice dialogue in various scenarios. For example, when a user practices ordering in a virtual restaurant simulation, an emotion engine can provide emotionally responsive interactions.

[0156] As a result, this invention goes beyond mere language acquisition, providing learning support tailored to the user's emotions, thereby guaranteeing a more effective and fulfilling learning experience for the user.

[0157] The following describes the processing flow.

[0158] Step 1:

[0159] The user accesses the system and selects their native language and the language they wish to learn. The terminal sends this selection information to the server.

[0160] Step 2:

[0161] The server retrieves the corresponding language resources from the database based on the selected language setting. Based on these resources, it prepares a language interface suitable for the user.

[0162] Step 3:

[0163] The user initiates conversation mode and speaks through the microphone. The device records the audio data and sends it to the server.

[0164] Step 4:

[0165] The server converts the received audio data into text data using speech recognition technology. This text is then analyzed to understand the content of the speech.

[0166] Step 5:

[0167] The server generates an appropriate response using a generative AI model based on the analyzed text data. This response is created in the target language.

[0168] Step 6:

[0169] The generated response is translated by the server into the user's native language and the target language. The translated content is sent to the device and displayed as subtitles on the screen.

[0170] Step 7:

[0171] Simultaneously, the device captures the user's voice tone and facial expressions using its camera and microphone, and sends them to the server.

[0172] Step 8:

[0173] The server's emotion engine analyzes the transmitted voice tone and facial expression data to recognize the user's emotional state.

[0174] Step 9:

[0175] Based on the user's perceived emotions, the server adjusts the tone and content of its responses. For example, if the user is feeling down, it might include encouraging messages.

[0176] Step 10:

[0177] The device displays the adjusted response to the user with subtitles, enabling emotionally sensitive interaction.

[0178] Step 11:

[0179] The server records both user sentiment data and learning content. Based on this, it creates a personalized learning plan that adjusts the content for the next learning session, thereby improving the user's learning efficiency.

[0180] Step 12:

[0181] When a user uses a VR / AR device, the server creates a virtual reality environment and provides a simulation that allows the user to interact within it.

[0182] (Example 2)

[0183] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0184] Conventional language learning systems struggle to adapt to individual user emotional states, potentially leading to decreased learning efficiency. Furthermore, they cannot adjust learning content according to emotional states, making it difficult to maintain user motivation. Additionally, real-time response generation presents a challenge in its inability to respond flexibly to user emotions.

[0185] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0186] In this invention, the server includes emotion recognition means for analyzing the tone of voice and facial expressions in the user's speech to recognize emotions; response adjustment means for adjusting the response generated based on emotion recognition; and learning plan provision means for personalizing the content of the next learning session based on the user's learning history and emotional state. This provides an interactive learning experience that responds to the user's emotional state, enabling effective learning while maintaining high user motivation.

[0187] "Settings" refers to a function that provides an interface for users to select their native language and a foreign language.

[0188] "Speech recognition means" refers to a technology for converting a user's voice input into text data, and is the process of recognizing speech and converting it into text.

[0189] "Generation means" includes algorithms and processes for creating appropriate responses based on text data.

[0190] "Translation and display means" refers to a method for translating the generated response into the user's native language and a foreign language and providing it to the user visually.

[0191] The "analysis and feedback mechanism" is a system that analyzes the grammar and pronunciation of the user's speech and provides feedback based on the results.

[0192] "Environment provision means" refers to a function that provides a virtual reality or augmented reality setting, enabling users to interact within that setting.

[0193] "Recording and plan provision means" refers to a system element that records the user's learning progress and proposes an individualized learning plan based on that data.

[0194] "Emotion recognition means" refers to technology that analyzes the user's voice tone and facial expression data to determine their current emotional state.

[0195] "Response adjustment means" refers to the process of appropriately adjusting the tone and content of a response based on the results of emotion recognition.

[0196] This invention is a bilingual AI system for improving the user's language learning experience, enabling interactive learning that takes into account the user's emotional state. Specific embodiments are described below.

[0197] The server runs on hardware with a high-performance processor and ample storage, and uses software such as generative AI models (e.g., GPT-3) and speech recognition technologies (e.g., common speech recognition APIs). This allows it to instantly convert user voice data into text data and generate appropriate responses.

[0198] The terminal functions as the user interface and is a device equipped with a touchscreen, microphone, and camera. The terminal receives input from the user and communicates data bidirectionally with the server. Specifically, the user selects their native language and the language to be learned on the terminal's screen, which activates the settings mechanism and sends the selection information to the server.

[0199] When a user makes a voice input into the terminal, the terminal records the voice and sends the voice data to the server. The server analyzes this data using speech recognition means and converts it into text data. Next, it uses generation means to generate an appropriate response based on this text data. Subsequently, the generated response is translated into both the user's native language and a foreign language through translation and display means and displayed on the user's terminal.

[0200] For emotion recognition, the device collects the user's voice tone and facial expressions using a camera and voice analysis technology, and transmits them to a server. The server uses emotion recognition means to determine the user's emotional state from this data and reflects the result in the response adjustment means. For example, if the user is disappointed, the server will generate a response in a gentle tone.

[0201] As a concrete example of this system, if a user types "How are you doing today?" into the terminal, the system will display a translated response such as "How are you?" via the server. Furthermore, by using a prompt such as "Generate a gentle and encouraging response in Japanese and English when the user is feeling down," the system can generate a contextually appropriate response that matches the user's emotions.

[0202] Thus, the system of the present invention can support the user's learning experience and provide emotionally responsive interactions through interactive communication based on voice input and generated responses.

[0203] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0204] Step 1:

[0205] The device prompts the user to select a language.

[0206] The user selects their native language and the foreign language they wish to learn from a list displayed on the device screen. The selected language information is entered, and the device sends this information to the server. Based on the received language information, the server searches its database for and retrieves appropriate language resources. This process allows the user to utilize an optimized language interface.

[0207] Step 2:

[0208] The user inputs by speaking.

[0209] The user speaks into the device's microphone. The device records this voice as a digital signal through the microphone. This digital signal is sent from the device to the server. The server receives this voice signal as input and converts it into text data using speech recognition. This conversion results in the user's voice being output as text data.

[0210] Step 3:

[0211] The server generates a response using a generated AI model.

[0212] The server takes the text data obtained through speech recognition as input and passes it to the generative AI model as a prompt. The generative AI model generates an appropriate response based on the input text, while understanding the context. This process results in a natural language response that corresponds to the user's input.

[0213] Step 4:

[0214] The server translates the generated response and sends it to the terminal.

[0215] The server performs a translation process to translate the generated response into the user's native language and the target language. Specifically, it uses a translation API to convert the response into two languages. This translated response data is output from the server to the terminal, allowing the user to view the response in both languages on the terminal's display.

[0216] Step 5:

[0217] The device collects user sentiment data and sends it to the server.

[0218] While the user speaks, the device continuously records the user's facial expressions and voice tone through its camera and microphone, collecting emotional data. This data is sent from the device to a server. The server uses emotion recognition to estimate the user's current emotional state from the acquired data. Based on this, this emotional information is taken into consideration when generating subsequent responses.

[0219] Step 6:

[0220] The server adjusts its response based on emotions.

[0221] Based on the output of the emotion recognition system, the server readjusts the tone and content of the generated response. For example, if the server detects that the user is depressed, it selects a response that includes encouragement and kindness. The adjusted response data is sent to the terminal as the final output, which the user can then receive.

[0222] (Application Example 2)

[0223] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0224] In today's learning environment, many people struggle to understand the nuances of different cultures and languages when trying to acquire a language. Furthermore, traditional language learning systems often provide uniform responses without considering the user's feelings, which can undermine learners' motivation. Therefore, a system is needed that takes into account the feelings of individual users and provides appropriate responses.

[0225] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0226] In this invention, the server includes a setting means that enables the selection of the native language and the language to be learned; a means that analyzes the user's voice input and converts it into text data by speech recognition; and an emotion recognition and adjustment means that recognizes the user's emotions and adjusts the content of the response based on those emotions. This makes it possible to provide a personalized learning experience that responds to the user's emotions and improve the quality of learning.

[0227] "Setting means" refers to a function that provides an interface for users to select their native language and the foreign language they wish to learn.

[0228] "Voice recognition means" refers to technology that converts a user's voice input into text data.

[0229] "Generation method" refers to the process of creating an appropriate response based on text data.

[0230] "Translation and display means" refers to a system that translates the generated response into the user's native language and the target language, and presents it visually to the user.

[0231] "Analysis and feedback means" refers to a system that analyzes the user's speech and provides feedback on grammar and pronunciation.

[0232] "Environment provision means" refers to technologies that provide users with a virtual reality or augmented reality environment, enabling them to interact with it.

[0233] "Recording and plan provision means" refers to a system for recording a user's learning progress and designing an individualized learning plan.

[0234] "Emotion recognition and adjustment means" refers to a function that recognizes the user's emotions and adjusts the content and tone of responses based on those emotions.

[0235] To implement this invention, it is first necessary for data to be transmitted and received between the server and the terminal. When a user accesses the system from the terminal, the terminal selects its native language and the foreign language to be learned through a configuration means. This selection information is sent to the server, which retrieves appropriate language resources from a database. This database is intended to provide a language interface optimized for the user.

[0236] Next, the user's voice input is recorded by the device's microphone. Speech recognition technology is used to convert this into text data. Based on this text data, the server uses a generative AI model to generate an appropriate response according to the user's utterance. This generated response is translated in real time into the user's native language and the target language by translation and display means and displayed on the device.

[0237] Furthermore, emotion recognition and adjustment mechanisms analyze the user's tone of voice and facial expressions as emotions via the camera. This emotion data is sent to a server and reflected in the content and tone of the generated response. For example, when the user has a cheerful expression, the server provides more engaging content.

[0238] Furthermore, the recording and plan provisioning mechanisms integrate learning history and sentiment data to design a personalized learning plan for the next learning session. In doing so, they take into account how the user has reacted in the past and suggest material that is engaging or enhances learning efficiency.

[0239] For example, if a user is using this system during a museum tour, the terminal will detect their interests and emotions and provide relevant additional information in real time. An example of a prompt message would be, "Please tell us about the most interesting episode from a past museum tour. Use that to generate guide information that might interest visitors."

[0240] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0241] Step 1:

[0242] The terminal provides an interface for the user to set their native language and the language to be learned. The language information selected by the user is sent to the server as input data. The server receives this information, performs data calculations to retrieve appropriate language resources from the database, and provides the output to the terminal.

[0243] Step 2:

[0244] The user provides voice input through the device's microphone. The device sends this voice data to the server. The server uses speech recognition to convert the voice data into text data. This process involves data processing called speech waveform analysis, and the output is in text format.

[0245] Step 3:

[0246] The server uses a generative AI model to take text data converted from speech as input and generate an appropriate response. The generative AI model performs natural language processing, analyzes the user's intent, and outputs the appropriate response text.

[0247] Step 4:

[0248] The generated response is translated by the server into the user's native language and the target language. The translated data is output and returned to the terminal. The terminal displays this on the screen, providing it to the user visually.

[0249] Step 5:

[0250] The device uses its camera and microphone to analyze the user's facial expressions and voice tone based on emotion recognition and adjustment mechanisms. The input for this analysis is the user's voice and video data, and the output is the user's emotional state. This information is sent to a server, influencing the tone and content of the generated response.

[0251] Step 6:

[0252] The server integrates the user's past voice data, learning history, and emotional information through recording and plan provisioning mechanisms to personalize the next learning plan. This process involves data calculations based on past data to generate the learning plan. The generated plan is then sent to the terminal for the next learning session.

[0253] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0254] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0255] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0256] [Second Embodiment]

[0257] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0258] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0259] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0260] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0261] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0262] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0263] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0264] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0265] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0266] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0267] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0268] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0269] This invention relates to a bilingual AI teacher system capable of engaging in dialogue in both the user's native language and the target language. This system provides interactive conversations with the user based on the language settings selected by the user. Specific embodiments of the system are described below.

[0270] When a user first accesses the system, the terminal prompts the user to select their native language and the foreign language they wish to learn. This information is sent to the server, which then retrieves the appropriate language resources from its database based on the selections.

[0271] To perform voice input, the user speaks through the microphone. The device acquires the voice data and sends it to the server. The server uses advanced speech recognition technology to convert the speech into text and analyzes its content. Based on this analysis, a generative AI model creates an appropriate response to the user's utterance.

[0272] The generated response is translated in real time by the server. The translated text is available in both the user's native language and the target language, and is sent to the terminal. The terminal displays both languages as subtitles on the user's screen. For example, if the user says "Hello, how are you?", the system will display "Hello, how are you?" as subtitles.

[0273] The server also analyzes the user's language use and identifies grammatical and pronunciation errors. The identified problems, along with suggestions for improvement, are sent to the device as feedback. This allows users to receive excellent feedback on their speech and effectively advance their learning.

[0274] This system can also utilize virtual reality (VR) and augmented reality (AR) environments. When users utilize these environments, the server can provide virtual scenes such as restaurants and airports, allowing them to simulate English conversations within those settings.

[0275] The server continuously records the user's learning progress. Based on this, the server provides the user with a learning plan optimized for the next learning session. The system stores the user's learning history in a database and analyzes this history to create a personalized learning plan. For example, if the user has a weak understanding of certain phrases or words, the next session will be adjusted to focus on those areas.

[0276] This system configuration allows users to learn a foreign language more effectively while receiving support in their native language.

[0277] The following describes the process flow.

[0278] Step 1:

[0279] The user logs in to the system and selects their native language and the language they want to learn. The terminal receives this selection information and sends it to the server.

[0280] Step 2:

[0281] The server retrieves appropriate language resources from the database based on the language settings selected by the user. This enables the system to provide an interface tailored to the user's language environment.

[0282] Step 3:

[0283] The user uses the microphone to speak. The terminal records the user's voice and sends the voice data to the server.

[0284] Step 4:

[0285] The server converts the received voice data into text using advanced speech recognition technology. This process makes the content of the voice analyzable as text data.

[0286] Step 5:

[0287] The server analyzes the text data converted from the voice and generates an appropriate response using a generative AI model. This response is accurately constructed based on the user's question or utterance content.

[0288] Step 6:

[0289] The server translates the generated response in real-time into the user's native language and the target language for learning. The translated text data is sent to the terminal.

[0290] Step 7:

[0291] The device displays the received translation data as subtitles on the user's screen. This allows the user to see the response to their speech in both their native language and the language they are learning.

[0292] Step 8:

[0293] The server analyzes the user's speech and identifies grammatical and pronunciation errors. It then sends feedback on the identified errors and suggestions for correction to the terminal.

[0294] Step 9:

[0295] The device displays feedback from the server to the user, supporting the user's learning improvement. This process allows the user to improve their language skills in real time.

[0296] Step 10:

[0297] If the user is using a virtual reality (VR) or augmented reality (AR) device, the server provides interaction within a virtual environment. This offers the user the opportunity to learn the language practically in a simulated setting.

[0298] Step 11:

[0299] The server tracks the user's learning progress and stores it in a database. Based on this data, it generates a personalized learning plan for the next learning session and proposes it to the user via the terminal.

[0300] (Example 1)

[0301] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0302] In language learning, effective support is required to enable smooth communication in both the native language and the target language to be acquired. Furthermore, real-time feedback and the provision of individualized learning plans according to the learning progress are demanded. In particular, it is an issue to immediately correct pronunciation and grammar errors and to establish an interactive and rich learning environment.

[0303] The specific processing by the specific processing unit 290 of the data processing device 12 in Example 1 is realized by the following respective means.

[0304] In this invention, the server includes setting means that enables the selection of the native language and the target language to be acquired, means for analyzing the user's voice input and converting it into text data by voice recognition technology, and generating means for creating an appropriate response generated based on the text data. Thereby, the user can effectively learn a foreign language while receiving a real-time translated response and obtaining feedback on pronunciation and grammar.

[0305] The "native language" is the language that serves as the basis for daily use by the user.

[0306] The "target language to be acquired" is the language that the user intends to newly learn.

[0307] The "setting means" is a device or program that provides an interface and functions for the user to select a language.

[0308] The "voice input" is the speech data that the user performs on the device.

[0309] The "voice recognition technology" is a technology for converting voice data into text data.

[0310] The "generating means" is a device or program for generating an appropriate response to the user's input.

[0311] "Translation and display means" refers to a device or program that translates a generated response into another language and displays it on a visual device.

[0312] "Analysis and feedback means" refers to a device or program that has the function of analyzing the content of a user's speech and providing methods for correcting or improving errors.

[0313] "Environment provision means" refers to a device or program that provides users with a virtual or augmented real world and enables interaction.

[0314] "Recording and planning means" refers to a device or program that has the function of recording the user's learning progress and creating and providing an individualized learning plan based on that progress.

[0315] The "function to display as subtitles" refers to a technology or program that displays responses generated in real time as text on the screen.

[0316] This system consists of electronic devices and software that provide dialogue in the user's chosen native language and target language. Processing is primarily handled through the cooperation of a server and a terminal. Details are provided below.

[0317] First, when a user accesses the system, the terminal displays a settings screen through the user interface, allowing the user to select their preferred language. The language information selected by the user is then sent from the terminal to the server. The hardware used at this stage consists of a display device and an input device.

[0318] Next, the server reads the necessary language resources from a database corresponding to the received language settings. This process uses a database management system and memory caching technology. The server then prepares to analyze the voice input from the user and uses speech recognition technology (e.g., a speech recognition API) to convert the voice data from the device into text data.

[0319] The voice data spoken by the user through the microphone is converted into a digital format on the device and sent to the server in real time. The server analyzes this voice data using advanced speech recognition technology and generates an appropriate response using a generative AI model. The AI model used is a general natural language processing model.

[0320] The generated response is translated on the server, prepared in both the user's native language and the target language, and sent to the terminal. The terminal displays this response as subtitles on a visual display device. For example, if the user says "Hello, how are you?", the terminal will display "Hello, how are you?". This allows the user to see bilingual responses in real time.

[0321] Furthermore, the server analyzes the user's speech to identify grammatical and pronunciation errors and sends feedback to the device. This feedback includes specific suggestions for improvement. The device displays this on its screen, encouraging the user to improve their speaking skills.

[0322] The system can also provide virtual reality and augmented reality environments, allowing users to have an interactive learning experience. The server generates simulation scenes, enabling conversation practice tailored to the user's learning level.

[0323] The user's learning progress is continuously recorded, and the server uses this data to generate a personalized learning plan, which is then provided in the next learning session. For example, if the user has a weak understanding of a particular phrase or grammatical structure, the server will recommend learning content that focuses on that area.

[0324] Example of a prompt:

[0325] "Translates user utterances in real time and generates responses."

[0326] "Recognizes user speech and provides pronunciation feedback."

[0327] Therefore, users can effectively learn foreign languages using this system.

[0328] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0329] Step 1:

[0330] When a user accesses the system, the terminal displays a screen for selecting their native language and the language they wish to learn. The user selects a language through the provided interface and confirms their selection. This user input is sent to the server as language setting information. The input is the user's language selection, and the output is the language setting information.

[0331] Step 2:

[0332] The server reads relevant language resources from the database based on the received language setting information. This database includes dictionary data, conversation templates, and other similar resources. The server caches this data in memory to enable fast data access. The input is language setting information, and the output is language resources.

[0333] Step 3:

[0334] The user inputs voice through the microphone. The terminal acquires this voice data, converts it to a digital format, and then sends it to the server. The input is the user's voice, and the output is digital voice data.

[0335] Step 4:

[0336] The server converts received digital audio data into text using speech recognition technology. It utilizes a speech recognition API to analyze the audio signal and replace it with corresponding text data. This analysis process involves extracting acoustic features and matching phonemes. The input is digital audio data, and the output is text data.

[0337] Step 5:

[0338] The server uses a generative AI model to generate responses based on the analyzed text data. It applies natural language processing algorithms to determine appropriate communication content based on the context. The input is text data, and the output is a generated response.

[0339] Step 6:

[0340] The server translates the generated response. Using a translation API, it converts the response to the user's native language and the target language, preparing support in both languages. The input is the generated response, and the output is the translated response.

[0341] Step 7:

[0342] The terminal receives the translated response and displays it on the screen as subtitles. This process involves real-time subtitle generation and display using a visual display device. The input is the translated response, and the output is the subtitle display.

[0343] Step 8:

[0344] The server analyzes the grammar and pronunciation of spoken content and provides feedback on areas for correction and improvement. It uses language analysis tools to identify pronunciation errors and calculate methods for improvement. Input is user speech data, and output is feedback information.

[0345] Step 9:

[0346] The server generates data to provide virtual reality and augmented reality environments, offering users an interactive experience. This environment data changes dynamically based on the user's learning. The input is the user's learning needs, and the output is the virtual environment data.

[0347] Step 10:

[0348] The server records the user's learning progress and creates a personalized learning plan. It analyzes past data and suggests appropriate learning materials and practice content for the next session. The input is learning history data, and the output is the learning plan.

[0349] (Application Example 1)

[0350] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0351] Currently, foreign language learning at home typically involves using textbooks and audio materials. However, effectively acquiring the practical skills needed in real-life conversational situations is difficult. In particular, there is a lack of interactive learning support using humanoid robotic devices, and there is a need to provide an environment where users can easily learn foreign languages practically at home.

[0352] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0353] In this invention, the server includes setting means that allows the user to select their native language and the language to be learned; means that analyze the user's voice input and convert it into text data using speech recognition means; translation and display means that translate the generated response into the native language and the language to be learned and display it; and means that control a humanoid automated device that supports voice dialogue in the home and provides simulated conversation scenes. This enables the user to engage in practical foreign language learning in the home.

[0354] "Setting means" refers to a function that allows the user to select their native language and the foreign language they wish to learn.

[0355] "Speech recognition means" refers to technology that analyzes speech data obtained from a user and converts it into text data.

[0356] The "generation means" refers to a function that generates an appropriate response based on text data obtained through speech recognition.

[0357] "Translation and display means" refers to technology that translates the generated response into the native language and a foreign language and presents it visually.

[0358] "Analysis and feedback means" refers to a function that evaluates the grammar and pronunciation of spoken content and provides suggestions for improvement.

[0359] "Environment provision means" refers to technologies that construct scenarios in which users can interact through virtual or augmented reality.

[0360] "Recording and plan provision means" refers to a function that saves the user's learning progress and suggests personalized learning content.

[0361] "Means for controlling humanoid automated devices" refers to the technology for operating automated devices that provide simulated conversation scenarios within the home.

[0362] To implement this invention, a humanoid automation device for home use and a server to control it are required. The server uses the following main hardware and software: a "speech recognition library" for speech recognition, and a generation AI such as "GPT-3" or "BERT". A "VR / AR engine" is used to construct virtual scenes.

[0363] First, the user selects the foreign language they wish to learn and their native language using the settings function of the humanoid robot. Voice input is sent to the server via a microphone and converted into text data by speech recognition technology. The server then uses this data to generate appropriate dialogue using a generative AI model and translates it in real time into both selected languages. This translation is returned to the user visually or audibly through the robot.

[0364] The server also analyzes the grammar and pronunciation of the user's speech and provides feedback. Using virtual or augmented reality, it constructs scenarios in which the user can interact with various situations, such as ordering at a cafe or checking in at an airport. Learning progress is recorded and used as data to personalize the content of the next learning session.

[0365] For example, if a user wants to practice ordering coffee at a cafe, the server will virtually provide that scenario and present prompts to simulate the conversation through an automated device. By using a prompt such as, "You play the role of a cafe employee and have the user order coffee. Please gently correct any pronunciation mistakes," the user can practice speaking.

[0366] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0367] Step 1:

[0368] The user selects their native language and the foreign language they wish to learn from their device. The input is the user's language selection, and the output is the selected language settings information. The server receives this information and retrieves the appropriate language resources from its database.

[0369] Step 2:

[0370] The user speaks into the microphone. The input is the user's voice data, and the output is the voice data sent to the server. The terminal acquires this voice data and sends it to the server.

[0371] Step 3:

[0372] The server uses a speech recognition library to convert the received audio data into text data. The input is audio data, and the output is the converted text data. Speech recognition transcribes the user's speech into text.

[0373] Step 4:

[0374] The server uses a generative AI model to analyze text data and generate appropriate responses. The input is the text data to be analyzed, and the output is the generated response. Based on this analysis, a response that corresponds to the dialogue is formed.

[0375] Step 5:

[0376] The generated response is translated into the user's native language and the target foreign language via a translation function. The input is the generated response, and the output is the response translated into both languages. The translated content is delivered to the device and displayed to the user.

[0377] Step 6:

[0378] To provide feedback on user speech, the server performs grammatical and pronunciation analysis. The input is the user's text data, and the output is feedback information. The analysis identifies pronunciation and grammatical errors, and suggests ways to improve them.

[0379] Step 7:

[0380] The server utilizes a virtual or augmented reality environment to provide the user with a specific scene. The input is a request for scene selection, and the output is the generated virtual environment. The terminal then provides this scene to the user to facilitate interaction.

[0381] Step 8:

[0382] The user's learning progress is recorded, and the next learning plan is personalized. The input is the recorded learning data, and the output is an optimized learning plan. The server analyzes the data and creates a plan that will be useful for the next learning session.

[0383] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0384] This invention combines an emotion recognition function with a bilingual AI system that engages in dialogue using the user's native language and the target foreign language. The aim of this system is to improve the user's language learning experience.

[0385] When a user first accesses the system, the terminal prompts them to select their native language and the foreign language they wish to learn. This selection information is sent to the server, which retrieves appropriate language resources from its database. This is to provide a language interface optimized for the user.

[0386] When a user performs voice input, their speech is recorded using a microphone. The device sends this voice data to a server, which converts the speech into text data using speech recognition technology. Based on this text data, a generative AI model is used to generate an appropriate response that corresponds to the user's speech.

[0387] The generated responses are translated in real time into the user's native language and the target language, and sent from the server to the terminal. The terminal displays this on the user's screen in both languages, allowing the user to see the conversation in both languages. For example, if the user asks "How are you doing today?", the system will display "How are you?".

[0388] A key element of this invention is the emotion engine, which analyzes the user's speech content along with their voice tone and facial expression data to recognize the user's current emotional state. This is done through camera and voice analysis technology. The recognized emotion is sent to the server and reflected in the tone and content of the generated response. For example, if the user speaks in a discouraged voice, the server will generate a response that is more encouraging and kind.

[0389] Furthermore, this emotional data is recorded along with the user's learning history. Based on this history, the server analyzes how the learning content emotionally affected the user. The results of this data analysis help design individual learning plans, for example, suggesting learning materials and simulations that enhance enjoyment and excitement for users whose modifier level is low.

[0390] Virtual reality and augmented reality environments allow users to practice dialogue in various scenarios. For example, when a user practices ordering in a virtual restaurant simulation, an emotion engine can provide emotionally responsive interactions.

[0391] As a result, this invention goes beyond mere language acquisition, providing learning support tailored to the user's emotions, thereby guaranteeing a more effective and fulfilling learning experience for the user.

[0392] The following describes the processing flow.

[0393] Step 1:

[0394] The user accesses the system and selects their native language and the language they wish to learn. The terminal sends this selection information to the server.

[0395] Step 2:

[0396] The server retrieves the corresponding language resources from the database based on the selected language setting. Based on these resources, it prepares a language interface suitable for the user.

[0397] Step 3:

[0398] The user initiates conversation mode and speaks through the microphone. The device records the audio data and sends it to the server.

[0399] Step 4:

[0400] The server converts the received audio data into text data using speech recognition technology. This text is then analyzed to understand the content of the speech.

[0401] Step 5:

[0402] The server generates an appropriate response using a generative AI model based on the analyzed text data. This response is created in the target language.

[0403] Step 6:

[0404] The generated response is translated by the server into the user's native language and the target language. The translated content is sent to the device and displayed as subtitles on the screen.

[0405] Step 7:

[0406] Simultaneously, the device captures the user's voice tone and facial expressions using its camera and microphone, and sends them to the server.

[0407] Step 8:

[0408] The server's emotion engine analyzes the transmitted voice tone and facial expression data to recognize the user's emotional state.

[0409] Step 9:

[0410] Based on the user's perceived emotions, the server adjusts the tone and content of its responses. For example, if the user is feeling down, it might include encouraging messages.

[0411] Step 10:

[0412] The device displays the adjusted response to the user with subtitles, enabling emotionally sensitive interaction.

[0413] Step 11:

[0414] The server records both user sentiment data and learning content. Based on this, it creates a personalized learning plan that adjusts the content for the next learning session, thereby improving the user's learning efficiency.

[0415] Step 12:

[0416] When a user uses a VR / AR device, the server creates a virtual reality environment and provides a simulation that allows the user to interact within it.

[0417] (Example 2)

[0418] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0419] Conventional language learning systems struggle to adapt to individual user emotional states, potentially leading to decreased learning efficiency. Furthermore, they cannot adjust learning content according to emotional states, making it difficult to maintain user motivation. Additionally, real-time response generation presents a challenge in its inability to respond flexibly to user emotions.

[0420] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0421] In this invention, the server includes emotion recognition means for analyzing the tone of voice and facial expressions in the user's speech to recognize emotions; response adjustment means for adjusting the response generated based on emotion recognition; and learning plan provision means for personalizing the content of the next learning session based on the user's learning history and emotional state. This provides an interactive learning experience that responds to the user's emotional state, enabling effective learning while maintaining high user motivation.

[0422] "Settings" refers to a function that provides an interface for users to select their native language and a foreign language.

[0423] "Speech recognition means" refers to a technology for converting a user's voice input into text data, and is the process of recognizing speech and converting it into text.

[0424] "Generation means" includes algorithms and processes for creating appropriate responses based on text data.

[0425] "Translation and display means" refers to a method for translating the generated response into the user's native language and a foreign language and providing it to the user visually.

[0426] The "analysis and feedback mechanism" is a system that analyzes the grammar and pronunciation of the user's speech and provides feedback based on the results.

[0427] "Environment provision means" refers to a function that provides a virtual reality or augmented reality setting, enabling users to interact within that setting.

[0428] "Recording and plan provision means" refers to a system element that records the user's learning progress and proposes an individualized learning plan based on that data.

[0429] "Emotion recognition means" refers to technology that analyzes the user's voice tone and facial expression data to determine their current emotional state.

[0430] "Response adjustment means" refers to the process of appropriately adjusting the tone and content of a response based on the results of emotion recognition.

[0431] This invention is a bilingual AI system for improving the user's language learning experience, enabling interactive learning that takes into account the user's emotional state. Specific embodiments are described below.

[0432] The server runs on hardware with a high-performance processor and ample storage, and uses software such as generative AI models (e.g., GPT-3) and speech recognition technologies (e.g., common speech recognition APIs). This allows it to instantly convert user voice data into text data and generate appropriate responses.

[0433] The terminal functions as the user interface and is a device equipped with a touchscreen, microphone, and camera. The terminal receives input from the user and communicates data bidirectionally with the server. Specifically, the user selects their native language and the language to be learned on the terminal's screen, which activates the settings mechanism and sends the selection information to the server.

[0434] When a user makes a voice input into the terminal, the terminal records the voice and sends the voice data to the server. The server analyzes this data using speech recognition means and converts it into text data. Next, it uses generation means to generate an appropriate response based on this text data. Subsequently, the generated response is translated into both the user's native language and a foreign language through translation and display means and displayed on the user's terminal.

[0435] For emotion recognition, the device collects the user's voice tone and facial expressions using a camera and voice analysis technology, and transmits them to a server. The server uses emotion recognition means to determine the user's emotional state from this data and reflects the result in the response adjustment means. For example, if the user is disappointed, the server will generate a response in a gentle tone.

[0436] As a concrete example of this system, if a user types "How are you doing today?" into the terminal, the system will display a translated response such as "How are you?" via the server. Furthermore, by using a prompt such as "Generate a gentle and encouraging response in Japanese and English when the user is feeling down," the system can generate a contextually appropriate response that matches the user's emotions.

[0437] Thus, the system of the present invention can support the user's learning experience and provide emotionally responsive interactions through interactive communication based on voice input and generated responses.

[0438] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0439] Step 1:

[0440] The device prompts the user to select a language.

[0441] The user selects their native language and the foreign language they wish to learn from a list displayed on the device screen. The selected language information is entered, and the device sends this information to the server. Based on the received language information, the server searches its database for and retrieves appropriate language resources. This process allows the user to utilize an optimized language interface.

[0442] Step 2:

[0443] The user inputs by speaking.

[0444] The user speaks into the device's microphone. The device records this voice as a digital signal through the microphone. This digital signal is sent from the device to the server. The server receives this voice signal as input and converts it into text data using speech recognition. This conversion results in the user's voice being output as text data.

[0445] Step 3:

[0446] The server generates a response using a generated AI model.

[0447] The server takes the text data obtained through speech recognition as input and passes it to the generative AI model as a prompt. The generative AI model generates an appropriate response based on the input text, while understanding the context. This process results in a natural language response that corresponds to the user's input.

[0448] Step 4:

[0449] The server translates the generated response and sends it to the terminal.

[0450] The server performs a translation process to translate the generated response into the user's native language and the target language. Specifically, it uses a translation API to convert the response into two languages. This translated response data is output from the server to the terminal, allowing the user to view the response in both languages on the terminal's display.

[0451] Step 5:

[0452] The device collects user sentiment data and sends it to the server.

[0453] While the user speaks, the device continuously records the user's facial expressions and voice tone through its camera and microphone, collecting emotional data. This data is sent from the device to a server. The server uses emotion recognition to estimate the user's current emotional state from the acquired data. Based on this, this emotional information is taken into consideration when generating subsequent responses.

[0454] Step 6:

[0455] The server adjusts its response based on emotions.

[0456] Based on the output of the emotion recognition system, the server readjusts the tone and content of the generated response. For example, if the server detects that the user is depressed, it selects a response that includes encouragement and kindness. The adjusted response data is sent to the terminal as the final output, which the user can then receive.

[0457] (Application Example 2)

[0458] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0459] In today's learning environment, many people struggle to understand the nuances of different cultures and languages when trying to acquire a language. Furthermore, traditional language learning systems often provide uniform responses without considering the user's feelings, which can undermine learners' motivation. Therefore, a system is needed that takes into account the feelings of individual users and provides appropriate responses.

[0460] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0461] In this invention, the server includes a setting means that enables the selection of the native language and the language to be learned; a means that analyzes the user's voice input and converts it into text data by speech recognition; and an emotion recognition and adjustment means that recognizes the user's emotions and adjusts the content of the response based on those emotions. This makes it possible to provide a personalized learning experience that responds to the user's emotions and improve the quality of learning.

[0462] "Setting means" refers to a function that provides an interface for users to select their native language and the foreign language they wish to learn.

[0463] "Voice recognition means" refers to technology that converts a user's voice input into text data.

[0464] "Generation method" refers to the process of creating an appropriate response based on text data.

[0465] "Translation and display means" refers to a system that translates the generated response into the user's native language and the target language, and presents it visually to the user.

[0466] "Analysis and feedback means" refers to a system that analyzes the user's speech and provides feedback on grammar and pronunciation.

[0467] "Environment provision means" refers to technologies that provide users with a virtual reality or augmented reality environment, enabling them to interact with it.

[0468] "Recording and plan provision means" refers to a system for recording a user's learning progress and designing an individualized learning plan.

[0469] "Emotion recognition and adjustment means" refers to a function that recognizes the user's emotions and adjusts the content and tone of responses based on those emotions.

[0470] To implement this invention, it is first necessary for data to be transmitted and received between the server and the terminal. When a user accesses the system from the terminal, the terminal selects its native language and the foreign language to be learned through a configuration means. This selection information is sent to the server, which retrieves appropriate language resources from a database. This database is intended to provide a language interface optimized for the user.

[0471] Next, the user's voice input is recorded by the device's microphone. Speech recognition technology is used to convert this into text data. Based on this text data, the server uses a generative AI model to generate an appropriate response according to the user's utterance. This generated response is translated in real time into the user's native language and the target language by translation and display means and displayed on the device.

[0472] Furthermore, emotion recognition and adjustment mechanisms analyze the user's tone of voice and facial expressions as emotions via the camera. This emotion data is sent to a server and reflected in the content and tone of the generated response. For example, when the user has a cheerful expression, the server provides more engaging content.

[0473] Furthermore, the recording and plan provisioning mechanisms integrate learning history and sentiment data to design a personalized learning plan for the next learning session. In doing so, they take into account how the user has reacted in the past and suggest material that is engaging or enhances learning efficiency.

[0474] For example, if a user is using this system during a museum tour, the terminal will detect their interests and emotions and provide relevant additional information in real time. An example of a prompt message would be, "Please tell us about the most interesting episode from a past museum tour. Use that to generate guide information that might interest visitors."

[0475] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0476] Step 1:

[0477] The terminal provides an interface for the user to set their native language and the language to be learned. The language information selected by the user is sent to the server as input data. The server receives this information, performs data calculations to retrieve appropriate language resources from the database, and provides the output to the terminal.

[0478] Step 2:

[0479] The user provides voice input through the device's microphone. The device sends this voice data to the server. The server uses speech recognition to convert the voice data into text data. This process involves data processing called speech waveform analysis, and the output is in text format.

[0480] Step 3:

[0481] The server uses a generative AI model to take text data converted from speech as input and generate an appropriate response. The generative AI model performs natural language processing, analyzes the user's intent, and outputs the appropriate response text.

[0482] Step 4:

[0483] The generated response is translated by the server into the user's native language and the target language. The translated data is output and returned to the terminal. The terminal displays this on the screen, providing it to the user visually.

[0484] Step 5:

[0485] The device uses its camera and microphone to analyze the user's facial expressions and voice tone based on emotion recognition and adjustment mechanisms. The input for this analysis is the user's voice and video data, and the output is the user's emotional state. This information is sent to a server, influencing the tone and content of the generated response.

[0486] Step 6:

[0487] The server integrates the user's past voice data, learning history, and emotional information through recording and plan provisioning mechanisms to personalize the next learning plan. This process involves data calculations based on past data to generate the learning plan. The generated plan is then sent to the terminal for the next learning session.

[0488] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0489] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0490] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0491] [Third Embodiment]

[0492] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0493] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0494] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0495] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0496] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0497] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0498] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0499] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0500] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0501] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0502] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0503] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0504] This invention relates to a bilingual AI teacher system capable of engaging in dialogue in both the user's native language and the target language. This system provides interactive conversations with the user based on the language settings selected by the user. Specific embodiments of the system are described below.

[0505] When a user first accesses the system, the terminal prompts the user to select their native language and the foreign language they wish to learn. This information is sent to the server, which then retrieves the appropriate language resources from its database based on the selections.

[0506] To perform voice input, the user speaks through the microphone. The device acquires the voice data and sends it to the server. The server uses advanced speech recognition technology to convert the speech into text and analyzes its content. Based on this analysis, a generative AI model creates an appropriate response to the user's utterance.

[0507] The generated response is translated in real time by the server. The translated text is available in both the user's native language and the target language, and is sent to the terminal. The terminal displays both languages as subtitles on the user's screen. For example, if the user says "Hello, how are you?", the system will display "Hello, how are you?" as subtitles.

[0508] The server also analyzes the user's language use and identifies grammatical and pronunciation errors. The identified problems, along with suggestions for improvement, are sent to the device as feedback. This allows users to receive excellent feedback on their speech and effectively advance their learning.

[0509] This system can also utilize virtual reality (VR) and augmented reality (AR) environments. When users utilize these environments, the server can provide virtual scenes such as restaurants and airports, allowing them to simulate English conversations within those settings.

[0510] The server continuously records the user's learning progress. Based on this, the server provides the user with a learning plan optimized for the next learning session. The system stores the user's learning history in a database and analyzes this history to create a personalized learning plan. For example, if the user has a weak understanding of certain phrases or words, the next session will be adjusted to focus on those areas.

[0511] This system configuration allows users to learn a foreign language more effectively while receiving support in their native language.

[0512] The following describes the processing flow.

[0513] Step 1:

[0514] The user logs into the system and selects their native language and the language they wish to learn. The terminal receives this selection information and sends it to the server.

[0515] Step 2:

[0516] The server retrieves appropriate language resources from the database based on the user's selected language settings. This allows the system to provide an interface tailored to the user's language environment.

[0517] Step 3:

[0518] The user uses a microphone to speak. The device records the user's voice and sends the audio data to the server.

[0519] Step 4:

[0520] The server converts the received audio data into text using advanced speech recognition technology. This process makes the content of the audio analyzable as text data.

[0521] Step 5:

[0522] The server analyzes the text data converted from the speech and generates an appropriate response using a generative AI model. This response is precisely constructed based on the user's questions and utterances.

[0523] Step 6:

[0524] The server translates the generated response in real time into the user's native language and the language they are learning. The translated text data is then sent to the terminal.

[0525] Step 7:

[0526] The device displays the received translation data as subtitles on the user's screen. This allows the user to see the response to their speech in both their native language and the language they are learning.

[0527] Step 8:

[0528] The server analyzes the user's speech and identifies grammatical and pronunciation errors. It then sends feedback on the identified errors and suggestions for correction to the terminal.

[0529] Step 9:

[0530] The device displays feedback from the server to the user, supporting the user's learning improvement. This process allows the user to improve their language skills in real time.

[0531] Step 10:

[0532] If the user is using a virtual reality (VR) or augmented reality (AR) device, the server provides interaction within a virtual environment. This offers the user the opportunity to learn the language practically in a simulated setting.

[0533] Step 11:

[0534] The server tracks the user's learning progress and stores it in a database. Based on this data, it generates a personalized learning plan for the next learning session and proposes it to the user via the terminal.

[0535] (Example 1)

[0536] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0537] In language learning, effective support is needed to enable smooth communication in both the native language and the target language. Furthermore, real-time feedback and personalized learning plans tailored to learning progress are required. In particular, the immediate correction of pronunciation and grammatical errors, and the development of an interactive and enriching learning environment are key challenges.

[0538] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0539] In this invention, the server includes a setting means that allows the user to select their native language and the language to be learned, a means that analyzes the user's voice input and converts it into text data using speech recognition technology, and a generation means that creates an appropriate response based on the text data. This allows the user to effectively learn a foreign language while receiving translated responses in real time and obtaining feedback on pronunciation and grammar.

[0540] "Native language" refers to the language that a user uses on a daily basis.

[0541] The "target language" is the language that the user is trying to learn.

[0542] "Setting means" refers to a device or program that provides an interface and functions for the user to select a language.

[0543] "Voice input" refers to spoken data that a user makes to a device.

[0544] "Speech recognition technology" is a technology for converting speech data into text data.

[0545] "Generation means" refers to a device or program for generating an appropriate response to user input.

[0546] "Translation and display means" refers to a device or program that translates a generated response into another language and displays it on a visual device.

[0547] "Analysis and feedback means" refers to a device or program that has the function of analyzing the content of a user's speech and providing methods for correcting or improving errors.

[0548] "Environment provision means" refers to a device or program that provides users with a virtual or augmented real world and enables interaction.

[0549] "Recording and planning means" refers to a device or program that has the function of recording the user's learning progress and creating and providing an individualized learning plan based on that progress.

[0550] The "function to display as subtitles" refers to a technology or program that displays responses generated in real time as text on the screen.

[0551] This system consists of electronic devices and software that provide dialogue in the user's chosen native language and target language. Processing is primarily handled through the cooperation of a server and a terminal. Details are provided below.

[0552] First, when a user accesses the system, the terminal displays a settings screen through the user interface, allowing the user to select their preferred language. The language information selected by the user is then sent from the terminal to the server. The hardware used at this stage consists of a display device and an input device.

[0553] Next, the server reads the necessary language resources from a database corresponding to the received language settings. This process uses a database management system and memory caching technology. The server then prepares to analyze the voice input from the user and uses speech recognition technology (e.g., a speech recognition API) to convert the voice data from the device into text data.

[0554] The voice data spoken by the user through the microphone is converted into a digital format on the device and sent to the server in real time. The server analyzes this voice data using advanced speech recognition technology and generates an appropriate response using a generative AI model. The AI model used is a general natural language processing model.

[0555] The generated response is translated on the server, prepared in both the user's native language and the target language, and sent to the terminal. The terminal displays this response as subtitles on a visual display device. For example, if the user says "Hello, how are you?", the terminal will display "Hello, how are you?". This allows the user to see bilingual responses in real time.

[0556] Furthermore, the server analyzes the user's speech to identify grammatical and pronunciation errors and sends feedback to the device. This feedback includes specific suggestions for improvement. The device displays this on its screen, encouraging the user to improve their speaking skills.

[0557] The system can also provide virtual reality and augmented reality environments, allowing users to have an interactive learning experience. The server generates simulation scenes, enabling conversation practice tailored to the user's learning level.

[0558] The user's learning progress is continuously recorded, and the server uses this data to generate a personalized learning plan, which is then provided in the next learning session. For example, if the user has a weak understanding of a particular phrase or grammatical structure, the server will recommend learning content that focuses on that area.

[0559] Example of a prompt:

[0560] "Translates user utterances in real time and generates responses."

[0561] "Recognizes user speech and provides pronunciation feedback."

[0562] Therefore, users can effectively learn foreign languages using this system.

[0563] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0564] Step 1:

[0565] When a user accesses the system, the terminal displays a screen for selecting their native language and the language they wish to learn. The user selects a language through the provided interface and confirms their selection. This user input is sent to the server as language setting information. The input is the user's language selection, and the output is the language setting information.

[0566] Step 2:

[0567] The server reads relevant language resources from the database based on the received language setting information. This database includes dictionary data, conversation templates, and other similar resources. The server caches this data in memory to enable fast data access. The input is language setting information, and the output is language resources.

[0568] Step 3:

[0569] The user inputs voice through the microphone. The terminal acquires this voice data, converts it to a digital format, and then sends it to the server. The input is the user's voice, and the output is digital voice data.

[0570] Step 4:

[0571] The server converts received digital audio data into text using speech recognition technology. It utilizes a speech recognition API to analyze the audio signal and replace it with corresponding text data. This analysis process involves extracting acoustic features and matching phonemes. The input is digital audio data, and the output is text data.

[0572] Step 5:

[0573] The server uses a generative AI model to generate responses based on the analyzed text data. It applies natural language processing algorithms to determine appropriate communication content based on the context. The input is text data, and the output is a generated response.

[0574] Step 6:

[0575] The server translates the generated response. Using a translation API, it converts the response to the user's native language and the target language, preparing support in both languages. The input is the generated response, and the output is the translated response.

[0576] Step 7:

[0577] The terminal receives the translated response and displays it on the screen as subtitles. This process involves real-time subtitle generation and display using a visual display device. The input is the translated response, and the output is the subtitle display.

[0578] Step 8:

[0579] The server analyzes the grammar and pronunciation of spoken content and provides feedback on areas for correction and improvement. It uses language analysis tools to identify pronunciation errors and calculate methods for improvement. Input is user speech data, and output is feedback information.

[0580] Step 9:

[0581] The server generates data to provide virtual reality and augmented reality environments, offering users an interactive experience. This environment data changes dynamically based on the user's learning. The input is the user's learning needs, and the output is the virtual environment data.

[0582] Step 10:

[0583] The server records the user's learning progress and creates a personalized learning plan. It analyzes past data and suggests appropriate learning materials and practice content for the next session. The input is learning history data, and the output is the learning plan.

[0584] (Application Example 1)

[0585] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0586] Currently, foreign language learning at home typically involves using textbooks and audio materials. However, effectively acquiring the practical skills needed in real-life conversational situations is difficult. In particular, there is a lack of interactive learning support using humanoid robotic devices, and there is a need to provide an environment where users can easily learn foreign languages practically at home.

[0587] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0588] In this invention, the server includes setting means that allows the user to select their native language and the language to be learned; means that analyze the user's voice input and convert it into text data using speech recognition means; translation and display means that translate the generated response into the native language and the language to be learned and display it; and means that control a humanoid automated device that supports voice dialogue in the home and provides simulated conversation scenes. This enables the user to engage in practical foreign language learning in the home.

[0589] "Setting means" refers to a function that allows the user to select their native language and the foreign language they wish to learn.

[0590] "Speech recognition means" refers to technology that analyzes speech data obtained from a user and converts it into text data.

[0591] The "generation means" refers to a function that generates an appropriate response based on text data obtained through speech recognition.

[0592] "Translation and display means" refers to technology that translates the generated response into the native language and a foreign language and presents it visually.

[0593] "Analysis and feedback means" refers to a function that evaluates the grammar and pronunciation of spoken content and provides suggestions for improvement.

[0594] "Environment provision means" refers to technologies that construct scenarios in which users can interact through virtual or augmented reality.

[0595] "Recording and plan provision means" refers to a function that saves the user's learning progress and suggests personalized learning content.

[0596] "Means for controlling humanoid automated devices" refers to the technology for operating automated devices that provide simulated conversation scenarios within the home.

[0597] To implement this invention, a humanoid automation device for home use and a server to control it are required. The server uses the following main hardware and software: a "speech recognition library" for speech recognition, and a generation AI such as "GPT-3" or "BERT". A "VR / AR engine" is used to construct virtual scenes.

[0598] First, the user selects the foreign language they wish to learn and their native language using the settings function of the humanoid robot. Voice input is sent to the server via a microphone and converted into text data by speech recognition technology. The server then uses this data to generate appropriate dialogue using a generative AI model and translates it in real time into both selected languages. This translation is returned to the user visually or audibly through the robot.

[0599] The server also analyzes the grammar and pronunciation of the user's speech and provides feedback. Using virtual or augmented reality, it constructs scenarios in which the user can interact with various situations, such as ordering at a cafe or checking in at an airport. Learning progress is recorded and used as data to personalize the content of the next learning session.

[0600] For example, if a user wants to practice ordering coffee at a cafe, the server will virtually provide that scenario and present prompts to simulate the conversation through an automated device. By using a prompt such as, "You play the role of a cafe employee and have the user order coffee. Please gently correct any pronunciation mistakes," the user can practice speaking.

[0601] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0602] Step 1:

[0603] The user selects their native language and the foreign language they wish to learn from their device. The input is the user's language selection, and the output is the selected language settings information. The server receives this information and retrieves the appropriate language resources from its database.

[0604] Step 2:

[0605] The user speaks into the microphone. The input is the user's voice data, and the output is the voice data sent to the server. The terminal acquires this voice data and sends it to the server.

[0606] Step 3:

[0607] The server uses a speech recognition library to convert the received audio data into text data. The input is audio data, and the output is the converted text data. Speech recognition transcribes the user's speech into text.

[0608] Step 4:

[0609] The server uses a generative AI model to analyze text data and generate appropriate responses. The input is the text data to be analyzed, and the output is the generated response. Based on this analysis, a response that corresponds to the dialogue is formed.

[0610] Step 5:

[0611] The generated response is translated into the user's native language and the target foreign language via a translation function. The input is the generated response, and the output is the response translated into both languages. The translated content is delivered to the device and displayed to the user.

[0612] Step 6:

[0613] To provide feedback on user speech, the server performs grammatical and pronunciation analysis. The input is the user's text data, and the output is feedback information. The analysis identifies pronunciation and grammatical errors, and suggests ways to improve them.

[0614] Step 7:

[0615] The server utilizes a virtual or augmented reality environment to provide the user with a specific scene. The input is a request for scene selection, and the output is the generated virtual environment. The terminal then provides this scene to the user to facilitate interaction.

[0616] Step 8:

[0617] The user's learning progress is recorded, and the next learning plan is personalized. The input is the recorded learning data, and the output is an optimized learning plan. The server analyzes the data and creates a plan that will be useful for the next learning session.

[0618] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0619] This invention combines an emotion recognition function with a bilingual AI system that engages in dialogue using the user's native language and the target foreign language. The aim of this system is to improve the user's language learning experience.

[0620] When a user first accesses the system, the terminal prompts them to select their native language and the foreign language they wish to learn. This selection information is sent to the server, which retrieves appropriate language resources from its database. This is to provide a language interface optimized for the user.

[0621] When a user performs voice input, their speech is recorded using a microphone. The device sends this voice data to a server, which converts the speech into text data using speech recognition technology. Based on this text data, a generative AI model is used to generate an appropriate response that corresponds to the user's speech.

[0622] The generated responses are translated in real time into the user's native language and the target language, and sent from the server to the terminal. The terminal displays this on the user's screen in both languages, allowing the user to see the conversation in both languages. For example, if the user asks "How are you doing today?", the system will display "How are you?".

[0623] A key element of this invention is the emotion engine, which analyzes the user's speech content along with their voice tone and facial expression data to recognize the user's current emotional state. This is done through camera and voice analysis technology. The recognized emotion is sent to the server and reflected in the tone and content of the generated response. For example, if the user speaks in a discouraged voice, the server will generate a response that is more encouraging and kind.

[0624] Furthermore, this emotional data is recorded along with the user's learning history. Based on this history, the server analyzes how the learning content emotionally affected the user. The results of this data analysis help design individual learning plans, for example, suggesting learning materials and simulations that enhance enjoyment and excitement for users whose modifier level is low.

[0625] Virtual reality and augmented reality environments allow users to practice dialogue in various scenarios. For example, when a user practices ordering in a virtual restaurant simulation, an emotion engine can provide emotionally responsive interactions.

[0626] As a result, this invention goes beyond mere language acquisition, providing learning support tailored to the user's emotions, thereby guaranteeing a more effective and fulfilling learning experience for the user.

[0627] The following describes the processing flow.

[0628] Step 1:

[0629] The user accesses the system and selects their native language and the language they wish to learn. The terminal sends this selection information to the server.

[0630] Step 2:

[0631] The server retrieves the corresponding language resources from the database based on the selected language setting. Based on these resources, it prepares a language interface suitable for the user.

[0632] Step 3:

[0633] The user initiates conversation mode and speaks through the microphone. The device records the audio data and sends it to the server.

[0634] Step 4:

[0635] The server converts the received audio data into text data using speech recognition technology. This text is then analyzed to understand the content of the speech.

[0636] Step 5:

[0637] The server generates an appropriate response using a generative AI model based on the analyzed text data. This response is created in the target language.

[0638] Step 6:

[0639] The generated response is translated by the server into the user's native language and the target language. The translated content is sent to the device and displayed as subtitles on the screen.

[0640] Step 7:

[0641] Simultaneously, the device captures the user's voice tone and facial expressions using its camera and microphone, and sends them to the server.

[0642] Step 8:

[0643] The server's emotion engine analyzes the transmitted voice tone and facial expression data to recognize the user's emotional state.

[0644] Step 9:

[0645] Based on the user's perceived emotions, the server adjusts the tone and content of its responses. For example, if the user is feeling down, it might include encouraging messages.

[0646] Step 10:

[0647] The device displays the adjusted response to the user with subtitles, enabling emotionally sensitive interaction.

[0648] Step 11:

[0649] The server records both user sentiment data and learning content. Based on this, it creates a personalized learning plan that adjusts the content for the next learning session, thereby improving the user's learning efficiency.

[0650] Step 12:

[0651] When a user uses a VR / AR device, the server creates a virtual reality environment and provides a simulation that allows the user to interact within it.

[0652] (Example 2)

[0653] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0654] Conventional language learning systems struggle to adapt to individual user emotional states, potentially leading to decreased learning efficiency. Furthermore, they cannot adjust learning content according to emotional states, making it difficult to maintain user motivation. Additionally, real-time response generation presents a challenge in its inability to respond flexibly to user emotions.

[0655] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0656] In this invention, the server includes emotion recognition means for analyzing the tone of voice and facial expressions in the user's speech to recognize emotions; response adjustment means for adjusting the response generated based on emotion recognition; and learning plan provision means for personalizing the content of the next learning session based on the user's learning history and emotional state. This provides an interactive learning experience that responds to the user's emotional state, enabling effective learning while maintaining high user motivation.

[0657] "Settings" refers to a function that provides an interface for users to select their native language and a foreign language.

[0658] "Speech recognition means" refers to a technology for converting a user's voice input into text data, and is the process of recognizing speech and converting it into text.

[0659] "Generation means" includes algorithms and processes for creating appropriate responses based on text data.

[0660] "Translation and display means" refers to a method for translating the generated response into the user's native language and a foreign language and providing it to the user visually.

[0661] The "analysis and feedback mechanism" is a system that analyzes the grammar and pronunciation of the user's speech and provides feedback based on the results.

[0662] "Environment provision means" refers to a function that provides a virtual reality or augmented reality setting, enabling users to interact within that setting.

[0663] "Recording and plan provision means" refers to a system element that records the user's learning progress and proposes an individualized learning plan based on that data.

[0664] "Emotion recognition means" refers to technology that analyzes the user's voice tone and facial expression data to determine their current emotional state.

[0665] "Response adjustment means" refers to the process of appropriately adjusting the tone and content of a response based on the results of emotion recognition.

[0666] This invention is a bilingual AI system for improving the user's language learning experience, enabling interactive learning that takes into account the user's emotional state. Specific embodiments are described below.

[0667] The server runs on hardware with a high-performance processor and ample storage, and uses software such as generative AI models (e.g., GPT-3) and speech recognition technologies (e.g., common speech recognition APIs). This allows it to instantly convert user voice data into text data and generate appropriate responses.

[0668] The terminal functions as the user interface and is a device equipped with a touchscreen, microphone, and camera. The terminal receives input from the user and communicates data bidirectionally with the server. Specifically, the user selects their native language and the language to be learned on the terminal's screen, which activates the settings mechanism and sends the selection information to the server.

[0669] When a user makes a voice input into the terminal, the terminal records the voice and sends the voice data to the server. The server analyzes this data using speech recognition means and converts it into text data. Next, it uses generation means to generate an appropriate response based on this text data. Subsequently, the generated response is translated into both the user's native language and a foreign language through translation and display means and displayed on the user's terminal.

[0670] For emotion recognition, the device collects the user's voice tone and facial expressions using a camera and voice analysis technology, and transmits them to a server. The server uses emotion recognition means to determine the user's emotional state from this data and reflects the result in the response adjustment means. For example, if the user is disappointed, the server will generate a response in a gentle tone.

[0671] As a concrete example of this system, if a user types "How are you doing today?" into the terminal, the system will display a translated response such as "How are you?" via the server. Furthermore, by using a prompt such as "Generate a gentle and encouraging response in Japanese and English when the user is feeling down," the system can generate a contextually appropriate response that matches the user's emotions.

[0672] Thus, the system of the present invention can support the user's learning experience and provide emotionally responsive interactions through interactive communication based on voice input and generated responses.

[0673] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0674] Step 1:

[0675] The device prompts the user to select a language.

[0676] The user selects their native language and the foreign language they wish to learn from a list displayed on the device screen. The selected language information is entered, and the device sends this information to the server. Based on the received language information, the server searches its database for and retrieves appropriate language resources. This process allows the user to utilize an optimized language interface.

[0677] Step 2:

[0678] The user inputs by speaking.

[0679] The user speaks into the device's microphone. The device records this voice as a digital signal through the microphone. This digital signal is sent from the device to the server. The server receives this voice signal as input and converts it into text data using speech recognition. This conversion results in the user's voice being output as text data.

[0680] Step 3:

[0681] The server generates a response using a generated AI model.

[0682] The server takes the text data obtained through speech recognition as input and passes it to the generative AI model as a prompt. The generative AI model generates an appropriate response based on the input text, while understanding the context. This process results in a natural language response that corresponds to the user's input.

[0683] Step 4:

[0684] The server translates the generated response and sends it to the terminal.

[0685] The server performs a translation process to translate the generated response into the user's native language and the target language. Specifically, it uses a translation API to convert the response into two languages. This translated response data is output from the server to the terminal, allowing the user to view the response in both languages on the terminal's display.

[0686] Step 5:

[0687] The device collects user sentiment data and sends it to the server.

[0688] While the user speaks, the device continuously records the user's facial expressions and voice tone through its camera and microphone, collecting emotional data. This data is sent from the device to a server. The server uses emotion recognition to estimate the user's current emotional state from the acquired data. Based on this, this emotional information is taken into consideration when generating subsequent responses.

[0689] Step 6:

[0690] The server adjusts its response based on emotions.

[0691] Based on the output of the emotion recognition system, the server readjusts the tone and content of the generated response. For example, if the server detects that the user is depressed, it selects a response that includes encouragement and kindness. The adjusted response data is sent to the terminal as the final output, which the user can then receive.

[0692] (Application Example 2)

[0693] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0694] In today's learning environment, many people struggle to understand the nuances of different cultures and languages when trying to acquire a language. Furthermore, traditional language learning systems often provide uniform responses without considering the user's feelings, which can undermine learners' motivation. Therefore, a system is needed that takes into account the feelings of individual users and provides appropriate responses.

[0695] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0696] In this invention, the server includes a setting means that enables the selection of the native language and the language to be learned; a means that analyzes the user's voice input and converts it into text data by speech recognition; and an emotion recognition and adjustment means that recognizes the user's emotions and adjusts the content of the response based on those emotions. This makes it possible to provide a personalized learning experience that responds to the user's emotions and improve the quality of learning.

[0697] "Setting means" refers to a function that provides an interface for users to select their native language and the foreign language they wish to learn.

[0698] "Voice recognition means" refers to technology that converts a user's voice input into text data.

[0699] "Generation method" refers to the process of creating an appropriate response based on text data.

[0700] "Translation and display means" refers to a system that translates the generated response into the user's native language and the target language, and presents it visually to the user.

[0701] "Analysis and feedback means" refers to a system that analyzes the user's speech and provides feedback on grammar and pronunciation.

[0702] "Environment provision means" refers to technologies that provide users with a virtual reality or augmented reality environment, enabling them to interact with it.

[0703] "Recording and plan provision means" refers to a system for recording a user's learning progress and designing an individualized learning plan.

[0704] "Emotion recognition and adjustment means" refers to a function that recognizes the user's emotions and adjusts the content and tone of responses based on those emotions.

[0705] To implement this invention, it is first necessary for data to be transmitted and received between the server and the terminal. When a user accesses the system from the terminal, the terminal selects its native language and the foreign language to be learned through a configuration means. This selection information is sent to the server, which retrieves appropriate language resources from a database. This database is intended to provide a language interface optimized for the user.

[0706] Next, the user's voice input is recorded by the device's microphone. Speech recognition technology is used to convert this into text data. Based on this text data, the server uses a generative AI model to generate an appropriate response according to the user's utterance. This generated response is translated in real time into the user's native language and the target language by translation and display means and displayed on the device.

[0707] Furthermore, emotion recognition and adjustment mechanisms analyze the user's tone of voice and facial expressions as emotions via the camera. This emotion data is sent to a server and reflected in the content and tone of the generated response. For example, when the user has a cheerful expression, the server provides more engaging content.

[0708] Furthermore, the recording and plan provisioning mechanisms integrate learning history and sentiment data to design a personalized learning plan for the next learning session. In doing so, they take into account how the user has reacted in the past and suggest material that is engaging or enhances learning efficiency.

[0709] For example, if a user is using this system during a museum tour, the terminal will detect their interests and emotions and provide relevant additional information in real time. An example of a prompt message would be, "Please tell us about the most interesting episode from a past museum tour. Use that to generate guide information that might interest visitors."

[0710] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0711] Step 1:

[0712] The terminal provides an interface for the user to set their native language and the language to be learned. The language information selected by the user is sent to the server as input data. The server receives this information, performs data calculations to retrieve appropriate language resources from the database, and provides the output to the terminal.

[0713] Step 2:

[0714] The user provides voice input through the device's microphone. The device sends this voice data to the server. The server uses speech recognition to convert the voice data into text data. This process involves data processing called speech waveform analysis, and the output is in text format.

[0715] Step 3:

[0716] The server uses a generative AI model to take text data converted from speech as input and generate an appropriate response. The generative AI model performs natural language processing, analyzes the user's intent, and outputs the appropriate response text.

[0717] Step 4:

[0718] The generated response is translated by the server into the user's native language and the target language. The translated data is output and returned to the terminal. The terminal displays this on the screen, providing it to the user visually.

[0719] Step 5:

[0720] The device uses its camera and microphone to analyze the user's facial expressions and voice tone based on emotion recognition and adjustment mechanisms. The input for this analysis is the user's voice and video data, and the output is the user's emotional state. This information is sent to a server, influencing the tone and content of the generated response.

[0721] Step 6:

[0722] The server integrates the user's past voice data, learning history, and emotional information through recording and plan provisioning mechanisms to personalize the next learning plan. This process involves data calculations based on past data to generate the learning plan. The generated plan is then sent to the terminal for the next learning session.

[0723] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0724] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0725] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0726] [Fourth Embodiment]

[0727] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0728] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0729] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0730] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0731] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0732] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0733] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0734] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0735] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0736] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0737] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0738] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0739] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0740] This invention relates to a bilingual AI teacher system capable of engaging in dialogue in both the user's native language and the target language. This system provides interactive conversations with the user based on the language settings selected by the user. Specific embodiments of the system are described below.

[0741] When a user first accesses the system, the terminal prompts the user to select their native language and the foreign language they wish to learn. This information is sent to the server, which then retrieves the appropriate language resources from its database based on the selections.

[0742] To perform voice input, the user speaks through the microphone. The device acquires the voice data and sends it to the server. The server uses advanced speech recognition technology to convert the speech into text and analyzes its content. Based on this analysis, a generative AI model creates an appropriate response to the user's utterance.

[0743] The generated response is translated in real time by the server. The translated text is available in both the user's native language and the target language, and is sent to the terminal. The terminal displays both languages as subtitles on the user's screen. For example, if the user says "Hello, how are you?", the system will display "Hello, how are you?" as subtitles.

[0744] The server also analyzes the user's language use and identifies grammatical and pronunciation errors. The identified problems, along with suggestions for improvement, are sent to the device as feedback. This allows users to receive excellent feedback on their speech and effectively advance their learning.

[0745] This system can also utilize virtual reality (VR) and augmented reality (AR) environments. When users utilize these environments, the server can provide virtual scenes such as restaurants and airports, allowing them to simulate English conversations within those settings.

[0746] The server continuously records the user's learning progress. Based on this, the server provides the user with a learning plan optimized for the next learning session. The system stores the user's learning history in a database and analyzes this history to create a personalized learning plan. For example, if the user has a weak understanding of certain phrases or words, the next session will be adjusted to focus on those areas.

[0747] This system configuration allows users to learn a foreign language more effectively while receiving support in their native language.

[0748] The following describes the processing flow.

[0749] Step 1:

[0750] The user logs into the system and selects their native language and the language they wish to learn. The terminal receives this selection information and sends it to the server.

[0751] Step 2:

[0752] The server retrieves appropriate language resources from the database based on the user's selected language settings. This allows the system to provide an interface tailored to the user's language environment.

[0753] Step 3:

[0754] The user uses a microphone to speak. The device records the user's voice and sends the audio data to the server.

[0755] Step 4:

[0756] The server converts the received audio data into text using advanced speech recognition technology. This process makes the content of the audio analyzable as text data.

[0757] Step 5:

[0758] The server analyzes the text data converted from the speech and generates an appropriate response using a generative AI model. This response is precisely constructed based on the user's questions and utterances.

[0759] Step 6:

[0760] The server translates the generated response in real time into the user's native language and the language they are learning. The translated text data is then sent to the terminal.

[0761] Step 7:

[0762] The device displays the received translation data as subtitles on the user's screen. This allows the user to see the response to their speech in both their native language and the language they are learning.

[0763] Step 8:

[0764] The server analyzes the user's speech and identifies grammatical and pronunciation errors. It then sends feedback on the identified errors and suggestions for correction to the terminal.

[0765] Step 9:

[0766] The device displays feedback from the server to the user, supporting the user's learning improvement. This process allows the user to improve their language skills in real time.

[0767] Step 10:

[0768] If the user is using a virtual reality (VR) or augmented reality (AR) device, the server provides interaction within a virtual environment. This offers the user the opportunity to learn the language practically in a simulated setting.

[0769] Step 11:

[0770] The server tracks the user's learning progress and stores it in a database. Based on this data, it generates a personalized learning plan for the next learning session and proposes it to the user via the terminal.

[0771] (Example 1)

[0772] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0773] In language learning, effective support is needed to enable smooth communication in both the native language and the target language. Furthermore, real-time feedback and personalized learning plans tailored to learning progress are required. In particular, the immediate correction of pronunciation and grammatical errors, and the development of an interactive and enriching learning environment are key challenges.

[0774] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0775] In this invention, the server includes a setting means that allows the user to select their native language and the language to be learned, a means that analyzes the user's voice input and converts it into text data using speech recognition technology, and a generation means that creates an appropriate response based on the text data. This allows the user to effectively learn a foreign language while receiving translated responses in real time and obtaining feedback on pronunciation and grammar.

[0776] "Native language" refers to the language that a user uses on a daily basis.

[0777] The "target language" is the language that the user is trying to learn.

[0778] "Setting means" refers to a device or program that provides an interface and functions for the user to select a language.

[0779] "Voice input" refers to spoken data that a user makes to a device.

[0780] "Speech recognition technology" is a technology for converting speech data into text data.

[0781] "Generation means" refers to a device or program for generating an appropriate response to user input.

[0782] "Translation and display means" refers to a device or program that translates a generated response into another language and displays it on a visual device.

[0783] "Analysis and feedback means" refers to a device or program that has the function of analyzing the content of a user's speech and providing methods for correcting or improving errors.

[0784] "Environment provision means" refers to a device or program that provides users with a virtual or augmented real world and enables interaction.

[0785] "Recording and planning means" refers to a device or program that has the function of recording the user's learning progress and creating and providing an individualized learning plan based on that progress.

[0786] The "function to display as subtitles" refers to a technology or program that displays responses generated in real time as text on the screen.

[0787] This system consists of electronic devices and software that provide dialogue in the user's chosen native language and target language. Processing is primarily handled through the cooperation of a server and a terminal. Details are provided below.

[0788] First, when a user accesses the system, the terminal displays a settings screen through the user interface, allowing the user to select their preferred language. The language information selected by the user is then sent from the terminal to the server. The hardware used at this stage consists of a display device and an input device.

[0789] Next, the server reads the necessary language resources from a database corresponding to the received language settings. This process uses a database management system and memory caching technology. The server then prepares to analyze the voice input from the user and uses speech recognition technology (e.g., a speech recognition API) to convert the voice data from the device into text data.

[0790] The voice data spoken by the user through the microphone is converted into a digital format on the device and sent to the server in real time. The server analyzes this voice data using advanced speech recognition technology and generates an appropriate response using a generative AI model. The AI model used is a general natural language processing model.

[0791] The generated response is translated on the server, prepared in both the user's native language and the target language, and sent to the terminal. The terminal displays this response as subtitles on a visual display device. For example, if the user says "Hello, how are you?", the terminal will display "Hello, how are you?". This allows the user to see bilingual responses in real time.

[0792] Furthermore, the server analyzes the user's speech to identify grammatical and pronunciation errors and sends feedback to the device. This feedback includes specific suggestions for improvement. The device displays this on its screen, encouraging the user to improve their speaking skills.

[0793] The system can also provide virtual reality and augmented reality environments, allowing users to have an interactive learning experience. The server generates simulation scenes, enabling conversation practice tailored to the user's learning level.

[0794] The user's learning progress is continuously recorded, and the server uses this data to generate a personalized learning plan, which is then provided in the next learning session. For example, if the user has a weak understanding of a particular phrase or grammatical structure, the server will recommend learning content that focuses on that area.

[0795] Example of a prompt:

[0796] "Translates user utterances in real time and generates responses."

[0797] "Recognizes user speech and provides pronunciation feedback."

[0798] Therefore, users can effectively learn foreign languages using this system.

[0799] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0800] Step 1:

[0801] When a user accesses the system, the terminal displays a screen for selecting their native language and the language they wish to learn. The user selects a language through the provided interface and confirms their selection. This user input is sent to the server as language setting information. The input is the user's language selection, and the output is the language setting information.

[0802] Step 2:

[0803] The server reads relevant language resources from the database based on the received language setting information. This database includes dictionary data, conversation templates, and other similar resources. The server caches this data in memory to enable fast data access. The input is language setting information, and the output is language resources.

[0804] Step 3:

[0805] The user inputs voice through the microphone. The terminal acquires this voice data, converts it to a digital format, and then sends it to the server. The input is the user's voice, and the output is digital voice data.

[0806] Step 4:

[0807] The server converts received digital audio data into text using speech recognition technology. It utilizes a speech recognition API to analyze the audio signal and replace it with corresponding text data. This analysis process involves extracting acoustic features and matching phonemes. The input is digital audio data, and the output is text data.

[0808] Step 5:

[0809] The server uses a generative AI model to generate responses based on the analyzed text data. It applies natural language processing algorithms to determine appropriate communication content based on the context. The input is text data, and the output is a generated response.

[0810] Step 6:

[0811] The server translates the generated response. Using a translation API, it converts the response to the user's native language and the target language, preparing support in both languages. The input is the generated response, and the output is the translated response.

[0812] Step 7:

[0813] The terminal receives the translated response and displays it on the screen as subtitles. This process involves real-time subtitle generation and display using a visual display device. The input is the translated response, and the output is the subtitle display.

[0814] Step 8:

[0815] The server analyzes the grammar and pronunciation of spoken content and provides feedback on areas for correction and improvement. It uses language analysis tools to identify pronunciation errors and calculate methods for improvement. Input is user speech data, and output is feedback information.

[0816] Step 9:

[0817] The server generates data to provide virtual reality and augmented reality environments, offering users an interactive experience. This environment data changes dynamically based on the user's learning. The input is the user's learning needs, and the output is the virtual environment data.

[0818] Step 10:

[0819] The server records the user's learning progress and creates a personalized learning plan. It analyzes past data and suggests appropriate learning materials and practice content for the next session. The input is learning history data, and the output is the learning plan.

[0820] (Application Example 1)

[0821] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0822] Currently, foreign language learning at home typically involves using textbooks and audio materials. However, effectively acquiring the practical skills needed in real-life conversational situations is difficult. In particular, there is a lack of interactive learning support using humanoid robotic devices, and there is a need to provide an environment where users can easily learn foreign languages practically at home.

[0823] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0824] In this invention, the server includes setting means that allows the user to select their native language and the language to be learned; means that analyze the user's voice input and convert it into text data using speech recognition means; translation and display means that translate the generated response into the native language and the language to be learned and display it; and means that control a humanoid automated device that supports voice dialogue in the home and provides simulated conversation scenes. This enables the user to engage in practical foreign language learning in the home.

[0825] "Setting means" refers to a function that allows the user to select their native language and the foreign language they wish to learn.

[0826] "Speech recognition means" refers to technology that analyzes speech data obtained from a user and converts it into text data.

[0827] The "generation means" refers to a function that generates an appropriate response based on text data obtained through speech recognition.

[0828] "Translation and display means" refers to technology that translates the generated response into the native language and a foreign language and presents it visually.

[0829] "Analysis and feedback means" refers to a function that evaluates the grammar and pronunciation of spoken content and provides suggestions for improvement.

[0830] "Environment provision means" refers to technologies that construct scenarios in which users can interact through virtual or augmented reality.

[0831] "Recording and plan provision means" refers to a function that saves the user's learning progress and suggests personalized learning content.

[0832] "Means for controlling humanoid automated devices" refers to the technology for operating automated devices that provide simulated conversation scenarios within the home.

[0833] To implement this invention, a humanoid automation device for home use and a server to control it are required. The server uses the following main hardware and software: a "speech recognition library" for speech recognition, and a generation AI such as "GPT-3" or "BERT". A "VR / AR engine" is used to construct virtual scenes.

[0834] First, the user selects the foreign language they wish to learn and their native language using the settings function of the humanoid robot. Voice input is sent to the server via a microphone and converted into text data by speech recognition technology. The server then uses this data to generate appropriate dialogue using a generative AI model and translates it in real time into both selected languages. This translation is returned to the user visually or audibly through the robot.

[0835] The server also analyzes the grammar and pronunciation of the user's speech and provides feedback. Using virtual or augmented reality, it constructs scenarios in which the user can interact with various situations, such as ordering at a cafe or checking in at an airport. Learning progress is recorded and used as data to personalize the content of the next learning session.

[0836] For example, if a user wants to practice ordering coffee at a cafe, the server will virtually provide that scenario and present prompts to simulate the conversation through an automated device. By using a prompt such as, "You play the role of a cafe employee and have the user order coffee. Please gently correct any pronunciation mistakes," the user can practice speaking.

[0837] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0838] Step 1:

[0839] The user selects their native language and the foreign language they wish to learn from their device. The input is the user's language selection, and the output is the selected language settings information. The server receives this information and retrieves the appropriate language resources from its database.

[0840] Step 2:

[0841] The user speaks into the microphone. The input is the user's voice data, and the output is the voice data sent to the server. The terminal acquires this voice data and sends it to the server.

[0842] Step 3:

[0843] The server uses a speech recognition library to convert the received audio data into text data. The input is audio data, and the output is the converted text data. Speech recognition transcribes the user's speech into text.

[0844] Step 4:

[0845] The server uses a generative AI model to analyze text data and generate appropriate responses. The input is the text data to be analyzed, and the output is the generated response. Based on this analysis, a response that corresponds to the dialogue is formed.

[0846] Step 5:

[0847] The generated response is translated into the user's native language and the target foreign language via a translation function. The input is the generated response, and the output is the response translated into both languages. The translated content is delivered to the device and displayed to the user.

[0848] Step 6:

[0849] To provide feedback on user speech, the server performs grammatical and pronunciation analysis. The input is the user's text data, and the output is feedback information. The analysis identifies pronunciation and grammatical errors, and suggests ways to improve them.

[0850] Step 7:

[0851] The server utilizes a virtual or augmented reality environment to provide the user with a specific scene. The input is a request for scene selection, and the output is the generated virtual environment. The terminal then provides this scene to the user to facilitate interaction.

[0852] Step 8:

[0853] The user's learning progress is recorded, and the next learning plan is personalized. The input is the recorded learning data, and the output is an optimized learning plan. The server analyzes the data and creates a plan that will be useful for the next learning session.

[0854] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0855] This invention combines an emotion recognition function with a bilingual AI system that engages in dialogue using the user's native language and the target foreign language. The aim of this system is to improve the user's language learning experience.

[0856] When a user first accesses the system, the terminal prompts them to select their native language and the foreign language they wish to learn. This selection information is sent to the server, which retrieves appropriate language resources from its database. This is to provide a language interface optimized for the user.

[0857] When a user performs voice input, their speech is recorded using a microphone. The device sends this voice data to a server, which converts the speech into text data using speech recognition technology. Based on this text data, a generative AI model is used to generate an appropriate response that corresponds to the user's speech.

[0858] The generated responses are translated in real time into the user's native language and the target language, and sent from the server to the terminal. The terminal displays this on the user's screen in both languages, allowing the user to see the conversation in both languages. For example, if the user asks "How are you doing today?", the system will display "How are you?".

[0859] A key element of this invention is the emotion engine, which analyzes the user's speech content along with their voice tone and facial expression data to recognize the user's current emotional state. This is done through camera and voice analysis technology. The recognized emotion is sent to the server and reflected in the tone and content of the generated response. For example, if the user speaks in a discouraged voice, the server will generate a response that is more encouraging and kind.

[0860] Furthermore, this emotional data is recorded along with the user's learning history. Based on this history, the server analyzes how the learning content emotionally affected the user. The results of this data analysis help design individual learning plans, for example, suggesting learning materials and simulations that enhance enjoyment and excitement for users whose modifier level is low.

[0861] Virtual reality and augmented reality environments allow users to practice dialogue in various scenarios. For example, when a user practices ordering in a virtual restaurant simulation, an emotion engine can provide emotionally responsive interactions.

[0862] As a result, this invention goes beyond mere language acquisition, providing learning support tailored to the user's emotions, thereby guaranteeing a more effective and fulfilling learning experience for the user.

[0863] The following describes the processing flow.

[0864] Step 1:

[0865] The user accesses the system and selects their native language and the language they wish to learn. The terminal sends this selection information to the server.

[0866] Step 2:

[0867] The server retrieves the corresponding language resources from the database based on the selected language setting. Based on these resources, it prepares a language interface suitable for the user.

[0868] Step 3:

[0869] The user initiates conversation mode and speaks through the microphone. The device records the audio data and sends it to the server.

[0870] Step 4:

[0871] The server converts the received audio data into text data using speech recognition technology. This text is then analyzed to understand the content of the speech.

[0872] Step 5:

[0873] The server generates an appropriate response using a generative AI model based on the analyzed text data. This response is created in the target language.

[0874] Step 6:

[0875] The generated response is translated by the server into the user's native language and the target language. The translated content is sent to the device and displayed as subtitles on the screen.

[0876] Step 7:

[0877] Simultaneously, the device captures the user's voice tone and facial expressions using its camera and microphone, and sends them to the server.

[0878] Step 8:

[0879] The server's emotion engine analyzes the transmitted voice tone and facial expression data to recognize the user's emotional state.

[0880] Step 9:

[0881] Based on the user's perceived emotions, the server adjusts the tone and content of its responses. For example, if the user is feeling down, it might include encouraging messages.

[0882] Step 10:

[0883] The device displays the adjusted response to the user with subtitles, enabling emotionally sensitive interaction.

[0884] Step 11:

[0885] The server records both user sentiment data and learning content. Based on this, it creates a personalized learning plan that adjusts the content for the next learning session, thereby improving the user's learning efficiency.

[0886] Step 12:

[0887] When a user uses a VR / AR device, the server creates a virtual reality environment and provides a simulation that allows the user to interact within it.

[0888] (Example 2)

[0889] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0890] Conventional language learning systems struggle to adapt to individual user emotional states, potentially leading to decreased learning efficiency. Furthermore, they cannot adjust learning content according to emotional states, making it difficult to maintain user motivation. Additionally, real-time response generation presents a challenge in its inability to respond flexibly to user emotions.

[0891] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0892] In this invention, the server includes emotion recognition means for analyzing the tone of voice and facial expressions in the user's speech to recognize emotions; response adjustment means for adjusting the response generated based on emotion recognition; and learning plan provision means for personalizing the content of the next learning session based on the user's learning history and emotional state. This provides an interactive learning experience that responds to the user's emotional state, enabling effective learning while maintaining high user motivation.

[0893] "Settings" refers to a function that provides an interface for users to select their native language and a foreign language.

[0894] "Speech recognition means" refers to a technology for converting a user's voice input into text data, and is the process of recognizing speech and converting it into text.

[0895] "Generation means" includes algorithms and processes for creating appropriate responses based on text data.

[0896] "Translation and display means" refers to a method for translating the generated response into the user's native language and a foreign language and providing it to the user visually.

[0897] The "analysis and feedback mechanism" is a system that analyzes the grammar and pronunciation of the user's speech and provides feedback based on the results.

[0898] "Environment provision means" refers to a function that provides a virtual reality or augmented reality setting, enabling users to interact within that setting.

[0899] "Recording and plan provision means" refers to a system element that records the user's learning progress and proposes an individualized learning plan based on that data.

[0900] "Emotion recognition means" refers to technology that analyzes the user's voice tone and facial expression data to determine their current emotional state.

[0901] "Response adjustment means" refers to the process of appropriately adjusting the tone and content of a response based on the results of emotion recognition.

[0902] This invention is a bilingual AI system for improving the user's language learning experience, enabling interactive learning that takes into account the user's emotional state. Specific embodiments are described below.

[0903] The server runs on hardware with a high-performance processor and ample storage, and uses software such as generative AI models (e.g., GPT-3) and speech recognition technologies (e.g., common speech recognition APIs). This allows it to instantly convert user voice data into text data and generate appropriate responses.

[0904] The terminal functions as the user interface and is a device equipped with a touchscreen, microphone, and camera. The terminal receives input from the user and communicates data bidirectionally with the server. Specifically, the user selects their native language and the language to be learned on the terminal's screen, which activates the settings mechanism and sends the selection information to the server.

[0905] When a user makes a voice input into the terminal, the terminal records the voice and sends the voice data to the server. The server analyzes this data using speech recognition means and converts it into text data. Next, it uses generation means to generate an appropriate response based on this text data. Subsequently, the generated response is translated into both the user's native language and a foreign language through translation and display means and displayed on the user's terminal.

[0906] For emotion recognition, the device collects the user's voice tone and facial expressions using a camera and voice analysis technology, and transmits them to a server. The server uses emotion recognition means to determine the user's emotional state from this data and reflects the result in the response adjustment means. For example, if the user is disappointed, the server will generate a response in a gentle tone.

[0907] As a concrete example of this system, if a user types "How are you doing today?" into the terminal, the system will display a translated response such as "How are you?" via the server. Furthermore, by using a prompt such as "Generate a gentle and encouraging response in Japanese and English when the user is feeling down," the system can generate a contextually appropriate response that matches the user's emotions.

[0908] Thus, the system of the present invention can support the user's learning experience and provide emotionally responsive interactions through interactive communication based on voice input and generated responses.

[0909] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0910] Step 1:

[0911] The device prompts the user to select a language.

[0912] The user selects their native language and the foreign language they wish to learn from a list displayed on the device screen. The selected language information is entered, and the device sends this information to the server. Based on the received language information, the server searches its database for and retrieves appropriate language resources. This process allows the user to utilize an optimized language interface.

[0913] Step 2:

[0914] The user inputs by speaking.

[0915] The user speaks into the device's microphone. The device records this voice as a digital signal through the microphone. This digital signal is sent from the device to the server. The server receives this voice signal as input and converts it into text data using speech recognition. This conversion results in the user's voice being output as text data.

[0916] Step 3:

[0917] The server generates a response using a generated AI model.

[0918] The server takes the text data obtained through speech recognition as input and passes it to the generative AI model as a prompt. The generative AI model generates an appropriate response based on the input text, while understanding the context. This process results in a natural language response that corresponds to the user's input.

[0919] Step 4:

[0920] The server translates the generated response and sends it to the terminal.

[0921] The server performs a translation process to translate the generated response into the user's native language and the target language. Specifically, it uses a translation API to convert the response into two languages. This translated response data is output from the server to the terminal, allowing the user to view the response in both languages on the terminal's display.

[0922] Step 5:

[0923] The device collects user sentiment data and sends it to the server.

[0924] While the user speaks, the device continuously records the user's facial expressions and voice tone through its camera and microphone, collecting emotional data. This data is sent from the device to a server. The server uses emotion recognition to estimate the user's current emotional state from the acquired data. Based on this, this emotional information is taken into consideration when generating subsequent responses.

[0925] Step 6:

[0926] The server adjusts its response based on emotions.

[0927] Based on the output of the emotion recognition system, the server readjusts the tone and content of the generated response. For example, if the server detects that the user is depressed, it selects a response that includes encouragement and kindness. The adjusted response data is sent to the terminal as the final output, which the user can then receive.

[0928] (Application Example 2)

[0929] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0930] In today's learning environment, many people struggle to understand the nuances of different cultures and languages when trying to acquire a language. Furthermore, traditional language learning systems often provide uniform responses without considering the user's feelings, which can undermine learners' motivation. Therefore, a system is needed that takes into account the feelings of individual users and provides appropriate responses.

[0931] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0932] In this invention, the server includes a setting means that enables the selection of the native language and the language to be learned; a means that analyzes the user's voice input and converts it into text data by speech recognition; and an emotion recognition and adjustment means that recognizes the user's emotions and adjusts the content of the response based on those emotions. This makes it possible to provide a personalized learning experience that responds to the user's emotions and improve the quality of learning.

[0933] "Setting means" refers to a function that provides an interface for users to select their native language and the foreign language they wish to learn.

[0934] "Voice recognition means" refers to technology that converts a user's voice input into text data.

[0935] "Generation method" refers to the process of creating an appropriate response based on text data.

[0936] "Translation and display means" refers to a system that translates the generated response into the user's native language and the target language, and presents it visually to the user.

[0937] "Analysis and feedback means" refers to a system that analyzes the user's speech and provides feedback on grammar and pronunciation.

[0938] "Environment provision means" refers to technologies that provide users with a virtual reality or augmented reality environment, enabling them to interact with it.

[0939] "Recording and plan provision means" refers to a system for recording a user's learning progress and designing an individualized learning plan.

[0940] "Emotion recognition and adjustment means" refers to a function that recognizes the user's emotions and adjusts the content and tone of responses based on those emotions.

[0941] To implement this invention, it is first necessary for data to be transmitted and received between the server and the terminal. When a user accesses the system from the terminal, the terminal selects its native language and the foreign language to be learned through a configuration means. This selection information is sent to the server, which retrieves appropriate language resources from a database. This database is intended to provide a language interface optimized for the user.

[0942] Next, the user's voice input is recorded by the device's microphone. Speech recognition technology is used to convert this into text data. Based on this text data, the server uses a generative AI model to generate an appropriate response according to the user's utterance. This generated response is translated in real time into the user's native language and the target language by translation and display means and displayed on the device.

[0943] Furthermore, emotion recognition and adjustment mechanisms analyze the user's tone of voice and facial expressions as emotions via the camera. This emotion data is sent to a server and reflected in the content and tone of the generated response. For example, when the user has a cheerful expression, the server provides more engaging content.

[0944] Furthermore, the recording and plan provisioning mechanisms integrate learning history and sentiment data to design a personalized learning plan for the next learning session. In doing so, they take into account how the user has reacted in the past and suggest material that is engaging or enhances learning efficiency.

[0945] For example, if a user is using this system during a museum tour, the terminal will detect their interests and emotions and provide relevant additional information in real time. An example of a prompt message would be, "Please tell us about the most interesting episode from a past museum tour. Use that to generate guide information that might interest visitors."

[0946] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0947] Step 1:

[0948] The terminal provides an interface for the user to set their native language and the language to be learned. The language information selected by the user is sent to the server as input data. The server receives this information, performs data calculations to retrieve appropriate language resources from the database, and provides the output to the terminal.

[0949] Step 2:

[0950] The user provides voice input through the device's microphone. The device sends this voice data to the server. The server uses speech recognition to convert the voice data into text data. This process involves data processing called speech waveform analysis, and the output is in text format.

[0951] Step 3:

[0952] The server uses a generative AI model to take text data converted from speech as input and generate an appropriate response. The generative AI model performs natural language processing, analyzes the user's intent, and outputs the appropriate response text.

[0953] Step 4:

[0954] The generated response is translated by the server into the user's native language and the target language. The translated data is output and returned to the terminal. The terminal displays this on the screen, providing it to the user visually.

[0955] Step 5:

[0956] The device uses its camera and microphone to analyze the user's facial expressions and voice tone based on emotion recognition and adjustment mechanisms. The input for this analysis is the user's voice and video data, and the output is the user's emotional state. This information is sent to a server, influencing the tone and content of the generated response.

[0957] Step 6:

[0958] The server integrates the user's past voice data, learning history, and emotional information through recording and plan provisioning mechanisms to personalize the next learning plan. This process involves data calculations based on past data to generate the learning plan. The generated plan is then sent to the terminal for the next learning session.

[0959] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0960] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0961] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0962] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0963] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0964] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0965] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0966] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0967] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0968] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0969] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0970] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0971] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0972] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0973] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0974] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0975] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0976] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0977] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0978] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0979] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted as being incorporated by reference.

[0980] The following is further disclosed regarding the embodiments described above.

[0981] (Claim 1)

[0982] A setting means that allows selection of the native language and the language to be learned,

[0983] A means for analyzing the user's voice input and converting it into text data using a speech recognition means,

[0984] A generation means for creating an appropriate response based on text data,

[0985] Translation and display means for translating and displaying the generated response into the native language and the target language,

[0986] An analysis and feedback means that analyzes the grammar and pronunciation of the user's utterances and provides feedback,

[0987] A means of providing an environment that offers a virtual reality or augmented reality environment and allows users to interact with it,

[0988] A means for recording and providing personalized learning plans, which records the user's learning progress and provides individualized learning plans.

[0989] A system that includes this.

[0990] (Claim 2)

[0991] The system according to claim 1, further comprising a function to display responses generated in real time as subtitles.

[0992] (Claim 3)

[0993] The system according to claim 1, further comprising a function that provides a learning plan that personalizes the content of the next learning session based on the user's learning history.

[0994] "Example 1"

[0995] (Claim 1)

[0996] A setting means that allows selection of the native language and the language to be learned,

[0997] A means for analyzing user voice input and converting it into text data using speech recognition technology,

[0998] A generation means for creating an appropriate response based on text data,

[0999] Translation and display means for translating the generated response into the native language and the target language and displaying it on a visual display device,

[1000] An analysis and feedback means that analyzes the syntax and speech of the user's utterances and provides feedback,

[1001] A means of providing an environment that provides a virtual or extended environment and allows users to interact with it,

[1002] A means for recording and providing a user's learning progress and creating an individualized learning plan,

[1003] A system that includes this.

[1004] (Claim 2)

[1005] The system according to claim 1, further comprising a function to display responses generated in real time as subtitles.

[1006] (Claim 3)

[1007] The system according to claim 1, further comprising a function that provides a learning plan that personalizes the content of the next learning session based on the user's learning history.

[1008] "Application Example 1"

[1009] (Claim 1)

[1010] A setting means that allows selection of the native language and the language to be learned,

[1011] A means for analyzing the user's voice input and converting it into text data using a speech recognition means,

[1012] A generation means for creating an appropriate response based on text data,

[1013] Translation and display means for translating and displaying the generated response into the native language and the target language,

[1014] An analysis and feedback means that analyzes the grammar and pronunciation of the user's utterances and provides feedback,

[1015] A means of providing an environment that offers a virtual reality or augmented reality environment and allows users to interact with it,

[1016] A means for recording and providing personalized learning plans, which records the user's learning progress and provides individualized learning plans.

[1017] A means for controlling a humanoid automated device that supports voice interaction within the home and provides simulated conversation scenes,

[1018] A system that includes this.

[1019] (Claim 2)

[1020] The system according to claim 1, further comprising a function to display responses generated in real time as subtitles.

[1021] (Claim 3)

[1022] The system according to claim 1, further comprising a function that provides a learning plan that personalizes the content of the next learning session based on the user's learning history.

[1023] "Example 2 of combining an emotion engine"

[1024] (Claim 1)

[1025] A setting means that allows selection of the native language and the language to be learned,

[1026] A means for analyzing the user's voice input and converting it into text data using a speech recognition means,

[1027] A generation means for creating an appropriate response based on text data,

[1028] Translation and display means for translating and displaying the generated response into the native language and the target language,

[1029] An analysis and feedback means that analyzes the grammar and pronunciation of the user's utterances and provides feedback,

[1030] A means of providing an environment that offers a virtual reality or augmented reality environment and allows users to interact with it,

[1031] A means for recording and providing personalized learning plans, which records the user's learning progress and provides individualized learning plans.

[1032] An emotion recognition method that analyzes the tone of voice and facial expressions in the user's speech to recognize emotions,

[1033] A response adjustment means that adjusts the response generated based on emotion recognition,

[1034] A system that includes this.

[1035] (Claim 2)

[1036] The system according to claim 1, further comprising a function to display responses generated in real time as subtitles.

[1037] (Claim 3)

[1038] The system according to claim 1, further comprising a function for providing a learning plan that personalizes the content of the next learning session based on the user's learning history and emotional state.

[1039] "Application example 2 when combining with an emotional engine"

[1040] (Claim 1)

[1041] A setting means that allows selection of the native language and the language to be learned,

[1042] A means for analyzing the user's voice input and converting it into text data using a speech recognition means,

[1043] A generation means for creating an appropriate response based on text data,

[1044] Translation and display means for translating and displaying the generated response into the native language and the target language,

[1045] An analysis and feedback means that analyzes the grammar and pronunciation of the user's utterances and provides feedback,

[1046] A means of providing an environment that offers a virtual reality or augmented reality environment and allows users to interact with it,

[1047] A means for recording and providing personalized learning plans, which records the user's learning progress and provides individualized learning plans.

[1048] An emotion recognition and adjustment means for recognizing the user's emotions and adjusting the content of the response based on those emotions,

[1049] A system that includes this.

[1050] (Claim 2)

[1051] The system according to claim 1, further comprising a function to display responses generated in real time as subtitles.

[1052] (Claim 3)

[1053] The system according to claim 1, which incorporates user emotion data into a learning plan provision function that personalizes the content of the next learning session based on the user's learning history. [Explanation of Symbols]

[1054] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A setting means that allows selection of the native language and the language to be learned, A means for analyzing the user's voice input and converting it into text data using a speech recognition means, A generation means for creating an appropriate response based on text data, Translation and display means for translating and displaying the generated response into the native language and the target language, An analysis and feedback means that analyzes the grammar and pronunciation of the user's utterances and provides feedback, A means of providing an environment that offers a virtual reality or augmented reality environment and allows users to interact with it, A means for recording and providing personalized learning plans, which records the user's learning progress and provides individualized learning plans. A system that includes this.

2. The system according to claim 1, further comprising a function to display responses generated in real time as subtitles.

3. The system according to claim 1, further comprising a function for providing a learning plan that personalizes the content of the next learning session based on the user's learning history.