system
The system addresses the challenge of visually impaired gamers by using image recognition, voice guides, and haptic feedback to facilitate gameplay interaction, enhancing the experience through real-time responsiveness and intuitive control.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- SOFTBANK GROUP CORP
- Filing Date
- 2024-12-13
- Publication Date
- 2026-06-25
AI Technical Summary
Many games are designed to rely on visual information, making it difficult for visually impaired individuals to play smoothly, and there is a need for a game environment that can be fully enjoyed without relying on vision, particularly in complex operations and dynamically changing screen situations.
A system that utilizes image recognition to analyze visual information in real-time, generates voice guides, provides voice input analysis, and includes automatic operation and haptic feedback to enable visually impaired users to interact with games without relying on visual information.
Enables visually impaired users to understand and control games through audio and haptic feedback, improving their gameplay experience by responding to rapidly changing game situations and reducing reliance on visual information.
Smart Images

Figure 2026104499000001_ABST
Abstract
Description
Technical Field
[0001] The technology of the present disclosure relates to a system.
Background Art
[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor, and includes steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.
Prior Art Documents
Patent Documents
[0003]
Patent Document 1
Summary of the Invention
Problems to be Solved by the Invention
[0004] Conventionally, when a user with visual impairment plays a game, there has been a problem that many games are designed to rely on visual information and it is difficult to play smoothly. For this reason, there has been a demand to provide a game environment that can be fully enjoyed by visually impaired people. In particular, in order to universally cope with complex operations of games and dynamically changing screen situations, an effective support technology using senses other than vision is required.
Means for Solving the Problems
[0005] To solve the aforementioned problems, the present invention provides image recognition means for analyzing visual information in real time via a user interface. It also provides voice guide generation means for generating voice guides based on the analyzed visual information, and voice input analysis means for receiving and analyzing voice instructions from the user and converting them into in-game actions. Furthermore, by providing automatic operation means for executing in-game actions based on voice instructions, it enables visually impaired users to operate and enjoy the game without relying on visual information. In addition, this system includes voice playback means for presenting the generated voice guides to the user, and further includes haptic feedback means for providing tactile feedback to support intuitive operation.
[0006] A "user interface" is a means of enabling users to interact directly with a system, and is an important interface for visually impaired individuals to perform input and output.
[0007] "Image recognition means" refers to technology that analyzes visual information and uses that information to identify situations and elements in a game (for example, characters and obstacles).
[0008] A "voice guide generation means" is a system that uses analyzed data to provide voice guidance on the actions a user should take.
[0009] A "voice input analysis means" is a means of receiving voice commands uttered by a user and converting them into commands that the system can understand.
[0010] "Automatic operation means" refers to a device or program that automatically performs specific actions within the game based on commands obtained by voice input analysis means.
[0011] "Audio playback means" refers to output devices such as speakers or headsets that provide the generated audio guide to the user in an easily understandable format.
[0012] "Haptic feedback means" refers to devices and technologies that provide feedback through physical vibrations, pressure, etc., in response to user actions or game situations. [Brief explanation of the drawing]
[0013] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14]It is a sequence diagram showing the processing flow of a data processing system in Application Example 2 when a sentiment engine is combined.
Embodiments for Carrying Out the Invention
[0014] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.
[0015] First, the terms used in the following description will be explained.
[0016] In the following embodiments, a processor with a reference number (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.
[0017] In the following embodiments, a RAM (Random Access Memory) with a reference number is a memory in which information is temporarily stored and is used as a work memory by the processor.
[0018] In the following embodiments, a storage with a reference number is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.
[0019] In the following embodiments, the signed communication interface (I / F) is an interface that includes a communication processor and an antenna, etc. The communication interface manages communication between multiple computers. Examples of communication standards applicable to the communication interface include wireless communication standards such as 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).
[0020] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."
[0021] [First Embodiment]
[0022] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.
[0023] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.
[0024] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0025] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.
[0026] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.
[0027] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.
[0028] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.
[0029] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.
[0030] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.
[0031] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0032] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0033] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0034] This invention is a system that enables visually impaired individuals to enjoy games in the same way as sighted individuals, and it consists of multiple functional modules. This system operates in cooperation with a server and terminals.
[0035] First, the server is equipped with image recognition capabilities to analyze real-time game screen data. These image recognition capabilities identify important elements within the game, such as characters, obstacles, and enemies, and digitize the position and status of each element. For example, in an action game, the server identifies the movement of enemies around the player character and determines which direction the user should focus their attention.
[0036] Next, the server uses the voice guide generation mechanism to generate voice guides appropriate to each situation based on the acquired data. These voice guides contain important information for the user and serve as a guide for action. For example, when an enemy approaches, specific instructions such as "Move to the left" are generated using speech synthesis.
[0037] This audio guide is transmitted to the device and provided to the user through an audio playback device. The user listens to the audio guide using a headset or similar device. Additionally, haptic feedback devices are activated as needed, providing intuitive information to the user through vibrations and other means. For example, when a specific action is required, the device uses vibration to provide feedback on the timing.
[0038] Furthermore, users input game commands by voice using voice input analysis via a terminal. When a user speaks instructions such as "attack" or "move right," the terminal recognizes them and sends the analysis results to the server. Based on this information, the server uses automated control mechanisms to immediately execute the corresponding action in the game.
[0039] This allows visually impaired users to understand the gameplay through audio and haptic feedback, without relying on visual information, and to have a direct and interactive experience. This system is particularly useful in situations where real-time responsiveness is required, such as in action games, as it can respond to rapidly changing game situations, thereby improving the user's gameplay experience.
[0040] The following describes the processing flow.
[0041] Step 1:
[0042] The server receives real-time screen data from the game. It captures the game's rendering frames and begins analysis using image recognition. Here, it identifies objects such as enemy characters, player characters, obstacles, and items, and extracts their positional information and movement patterns.
[0043] Step 2:
[0044] The server constructs an audio guide based on the analysis results. The audio guide generation system creates appropriate instructions according to the game situation and formats them as text data. For example, if an enemy is approaching from the right, it will create a guide that says, "An enemy is coming from the right. Please dodge." This text is then prepared to be converted into audio.
[0045] Step 3:
[0046] The server generates the audio guide and sends it to the terminal as a digital audio file. Here, text-to-speech technology is used to convert the text into audio data. The terminal receives this audio data and prepares to output the audio to the user.
[0047] Step 4:
[0048] The device transmits received audio guidance to the user using an audio playback device. The user listens to the audio guidance through a headset or speaker to understand the game situation. Simultaneously, haptic feedback devices generate vibrations to convey urgent information or prompt specific actions.
[0049] Step 5:
[0050] The user issues voice commands through the device. If the user commands a specific action, such as "jump" or "attack," the voice is recorded by the device.
[0051] Step 6:
[0052] The terminal analyzes the user's voice commands using voice input analysis technology. The analysis results are sent to the server as digital commands. This converts the voice content into specific game instructions.
[0053] Step 7:
[0054] Based on the received commands, the server sends instructions to the game engine via automated control mechanisms to execute corresponding actions. This allows the in-game characters to immediately perform actions such as "jump" or "attack."
[0055] Step 8:
[0056] The server and terminal work together, repeating these steps in real time, enabling users to continuously and intuitively control and enjoy the game.
[0057] (Example 1)
[0058] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0059] The challenge is to provide an environment where visually impaired individuals can enjoy games that require complex real-time interaction without relying on visual information. It is necessary to establish effective means for users who have difficulty directly obtaining visual information to grasp the game situation and take appropriate actions.
[0060] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0061] In this invention, the server includes data analysis means, information conversion means, and input analysis means. This converts visual information into audio and haptic feedback, enabling the user to play the game in real time.
[0062] "Data analysis means" refers to a device or software that has the function of acquiring visual information in real time, analyzing that information, and converting it into meaningful data.
[0063] An "information conversion means" is a device or software that generates information to be conveyed to the user based on analyzed data and provides it primarily in the form of sound or haptic feedback.
[0064] "Input analysis means" refers to a device or software that receives voice instructions from a user, analyzes their content, and converts them into a format suitable for control within the system.
[0065] "Control means" refers to a device or software that has the function of performing a specified action based on instructions from the user.
[0066] "Playback means" refers to a device or software that actually outputs the generated audio guide as audio.
[0067] A "feedback mechanism" is a device or software that has the function of providing feedback to the user through touch, conveying intuitive information to the user through vibrations or other means.
[0068] This invention is a system that enables visually impaired individuals to enjoy games without relying on their vision. The server and terminal work together to provide a comfortable interactive environment for the user.
[0069] The server acquires real-time screen data from the game and analyzes it using data analysis tools. Specifically, it uses image recognition software (e.g., a general-purpose library) to identify elements in the game such as characters, obstacles, and enemies. The obtained information is converted into audio guidance by an information conversion tool, and speech synthesis software such as Google® Text-to-Speech is used to generate instructions for the user. For example, if a character is in danger, it might create an audio guide saying, "Move to the left to avoid danger."
[0070] This audio guide is transmitted to the device and provided to the user through the device's playback mechanism. The user listens to it through an audio device such as a headset. The device also has a feedback mechanism that vibrates according to instructions from the server to communicate in-game timing and actions to the user. The device vibrates when a specific enemy approaches, allowing the user to intuitively understand which direction to focus their attention.
[0071] Users can give instructions to their devices using voice commands. These voice commands are analyzed by an input analysis system and converted into text data. This text is then sent to the server, where the required actions in the game are executed through an automated control function. For example, if a user says "jump," the server will immediately configure the system to reflect that command in the game.
[0072] This system allows users to enjoy games in real time. The generative AI model generates prompts tailored to user needs, improving operational efficiency and the user experience. An example of a prompt might be, "Designing a real-time audio-guided game system for the visually impaired."
[0073] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0074] Step 1:
[0075] The server acquires screen data from the game in real time. The input is screen information received through the game's API, and the output is image data that can be analyzed. The server receives this image data for use in the next step and stores it in a processing queue.
[0076] Step 2:
[0077] The server uses data analysis tools to analyze the acquired image data. The input is the image data acquired in step 1, and the output is elemental information such as characters, obstacles, and enemies in the game. An image recognition library (e.g., a general-purpose library) is used to identify each element and extract its position and movement as numerical data.
[0078] Step 3:
[0079] Based on the analysis results, the server generates an audio guide using an information conversion mechanism. The input is the elemental information obtained in step 2, and the output is the audio guide text provided to the user. The server uses speech synthesis software such as Google Text-to-Speech to generate specific instructions in audio format, such as "You are approaching an obstacle. Move to the left."
[0080] Step 4:
[0081] The terminal receives audio guidance transmitted from the server and provides it to the user via a playback mechanism. The input is audio guidance data from the server, and the output is the audio transmitted to the user. Here, the terminal plays the audio through the headset and, if necessary, uses its vibration function to convey information as haptic feedback.
[0082] Step 5:
[0083] The user inputs commands into the terminal by voice. The input is the user's voice instructions, and the output is command data in text format. The terminal analyzes this voice using an input parsing device and prepares to send it to the server. For example, a command such as "jump" is extracted from the voice.
[0084] Step 6:
[0085] The server receives the voice analysis results and executes in-game actions using its automated control function. The input is the text command data from step 5, and the output is the execution of the action in the game. The server processes this in real time and sends instructions to make the avatar move according to the user's commands.
[0086] (Application Example 1)
[0087] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0088] There is a need to provide means of support that enable visually impaired people to move more safely and efficiently in their daily lives. In particular, systems that can respond quickly to changes in the environment and obstacles are necessary.
[0089] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0090] In this invention, the server includes an image analysis means for analyzing environmental information in real time via a user interface, a voice instruction generation means for generating voice instructions based on the analyzed visual information, and a voice analysis means for receiving, analyzing, and converting voice instructions from the user into action commands. This makes it possible for visually impaired people to understand their surroundings and select appropriate actions.
[0091] A "user interface" is a means for a user to interact with a system, and is a physical or logical component for inputting and outputting information.
[0092] "Environmental information" refers to information that influences user behavior, such as the surrounding environment, the location of obstacles, and the movement path.
[0093] "Real-time analysis" refers to a processing method that immediately handles the ongoing situation and makes the results immediately available.
[0094] "Image analysis means" refers to technical means for analyzing visual data obtained from cameras and sensors and extracting important information.
[0095] "Voice instruction generation means" refers to a technical means for generating voice messages that instruct the user to take action based on analyzed data.
[0096] "Voice analysis means" refers to the process of analyzing voice commands received from a user and understanding their content.
[0097] "Automatic control means" refers to technical means that automatically manage and control the operation of a system or device based on user input and environmental information.
[0098] The system necessary to implement this invention primarily involves the user, server, and terminal each playing a specific role. The server uses a 360-degree camera and a LiDAR sensor to capture information about the surrounding environment in real time. This allows it to acquire visual information about the environment in which the user is located, and analyze this data using OpenCV. Based on the information obtained from the analysis, the voice instruction generation means generates voice instructions using a TTS (Text-to-Speech) engine.
[0099] The terminal is a device for presenting these generated voice instructions to the user. The terminal is equipped with sound playback means, which are used to clearly convey the generated voice instructions to the user. Therefore, the user can receive the information immediately. In addition, the terminal has haptic feedback means, which can provide tactile information to the user using a vibration motor or similar device.
[0100] Users can input voice commands through a terminal. A voice analysis system then analyzes the user's instructions and converts them into appropriate action commands. These commands are sent to a server, and an automated control system executes the appropriate actions based on that information. This enables visually impaired individuals to understand their environment and move quickly and safely.
[0101] A concrete example is a scenario where a visually impaired person needs to walk safely in a city. When the server scans the surroundings and detects that the sidewalk is under construction, the voice instruction generation system generates specific instructions such as "Take the narrow path to the right," helping the user to safely detour. In this way, information can be provided immediately, prompting action.
[0102] Examples of prompts include, "Please have the robot guide you to the optimal detour route in real time based on visual information," and "Please have the robot perform the necessary actions based on the user's voice commands." This allows for maximum utilization of the system's capabilities.
[0103] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0104] Step 1:
[0105] The server continuously captures information about the surrounding environment using a 360-degree camera and LiDAR sensors. This input data is sent to the server as visual data. The server uses OpenCV to analyze this data and identify important elements such as pedestrians, obstacles, and direction of movement. The analysis yields data about the position and movement of each element in the environment.
[0106] Step 2:
[0107] The server generates voice instructions using a TTS engine based on the analyzed visual information. For example, if it detects an obstacle ahead, it will generate a voice instruction such as "Move to the right." The input is the server's analyzed data, and the output is voice data presented to the user.
[0108] Step 3:
[0109] The terminal receives audio data transmitted from the server and uses an audio playback device to present voice instructions to the user. In this case, the terminal's role is to convey the voice instructions to the user in real time. The output is the voice instructions that the user actually hears.
[0110] Step 4:
[0111] The terminal simultaneously utilizes haptic feedback mechanisms, such as vibration motors, to provide tactile information to the user. This allows the terminal to warn the user of situations requiring particular attention. The input is a control command received from the server, and the output is the vibration action performed by the terminal.
[0112] Step 5:
[0113] The user makes decisions and chooses actions based on voice instructions, and issues voice commands to the terminal as needed. The user's voice commands are then sent back to the terminal as input.
[0114] Step 6:
[0115] The terminal receives voice input from the user and interprets its content using voice analysis tools. The analysis results are sent to the server. The input is the user's voice command, and the output is communication data to the server.
[0116] Step 7:
[0117] The server updates the system's operation and instructions using automated control mechanisms based on the analysis results. This adjusts the overall system behavior according to user requests. The input is voice analysis data from the terminal, and the output is the updated control commands.
[0118] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0119] This invention provides a system that enables visually impaired users to enjoy games in an immersive way, and in particular, by combining it with an emotion engine, it provides adaptive interaction that takes the user's emotions into consideration.
[0120] The system operates primarily through a server and a terminal. First, the server acquires real-time screen data from the game, analyzes it using image recognition, and identifies important elements within the game. Based on this analysis, an audio guide generation system creates situation-appropriate audio guides. The generated guides are converted from text to audio and sent from the server to the terminal.
[0121] The device provides the user with audio guidance obtained through audio playback means. The user can receive information via a headset or speaker. It also uses haptic feedback means to provide the user with feedback tailored to specific game situations. For example, if an enemy is approaching, the device will vibrate to warn the user.
[0122] Furthermore, the system incorporates an emotion engine, allowing the device to recognize emotions in real time through the user's facial expressions and tone of voice. The recognized emotion data is sent to a server, which then adjusts the content of the voice guide and the settings of the haptic feedback based on that information. For example, if the user is feeling stressed, the tone of the voice guide can be made calmer, or the feedback can be made gentler.
[0123] Users can give voice commands for in-game actions, and a voice input analysis system analyzes these commands and sends them to the server. The server then uses an automated system to execute the in-game actions based on this information. This enables interaction that takes user emotions into account, resulting in comfortable and intuitive gameplay. The system iterates through these processes in real time, responding flexibly to user emotions and game progress.
[0124] The following describes the processing flow.
[0125] Step 1:
[0126] The server captures real-time screen data from the game and begins analysis through image recognition. This process identifies characters, obstacles, background information, etc., within the game, and based on this, acquires location information and situational data.
[0127] Step 2:
[0128] The server activates the audio guide generation system based on the analyzed data and generates the necessary audio guides as text. These guides include practical instructions and feedback for the player. For example, instructions such as "There is an enemy at 3 o'clock" might be created.
[0129] Step 3:
[0130] The server converts the generated audio guide into audio data and sends it to the terminal as a digital audio file. The file is formatted for playback on the terminal.
[0131] Step 4:
[0132] The terminal provides the user with received audio data using an audio playback device. The user receives this data through a headset and can obtain verbal instructions. Simultaneously, haptic feedback devices provide additional information through vibrations as needed.
[0133] Step 5:
[0134] The device activates an emotion engine and analyzes the user's facial expressions and tone of voice to recognize the user's emotional state in real time. This information is used to further improve the user's gameplay experience.
[0135] Step 6:
[0136] The recognized emotion data is sent to a server, which then adjusts the tone and content of the voice guide based on that information. It also changes the intensity and pattern of haptic feedback as needed, customizing the experience to match the user's emotions.
[0137] Step 7:
[0138] The user inputs voice commands into the terminal. For example, instructions such as "jump" or "attack" are spoken aloud. The terminal receives the voice and converts it into text using a voice input analysis device.
[0139] Step 8:
[0140] The terminal sends the analyzed voice commands to the server, and upon receiving this information, the server uses automated control mechanisms to execute the corresponding actions within the game. The server's instructions enable the character's specific movements to be realized in real time.
[0141] Step 9:
[0142] By repeating these steps in sequence, users can intuitively control and enjoy the game while responding to emotional changes. An adaptive system based on user feedback ensures a comfortable gameplay experience.
[0143] (Example 2)
[0144] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0145] It is difficult for visually impaired users to effectively enjoy computer games that typically rely on visual interaction. In particular, they require interfaces that appropriately adjust according to the game's progress and the player's emotional state, but no existing systems can achieve this. Therefore, there is a need for a system that considers the user's emotions, provides real-time in-game information, and allows the user to operate it intuitively.
[0146] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0147] In this invention, the server includes an image recognition means for analyzing visual information in real time, a means for generating audio guides and performing emotion recognition based on the analyzed visual information, and a means for receiving and analyzing voice instructions from the user. This allows visually impaired users to immerse themselves in the game while reducing emotional burden, and enables the user interface to flexibly adapt to the user's instantaneous emotions and instructions.
[0148] A "user interface" refers to the means or methods by which a user interacts with a system or application.
[0149] "Visual information" refers to all data that can be represented as images or videos, and this includes characters, backgrounds, and actions within a game.
[0150] "Real-time" refers to processing or responding within the actual time frame, with virtually no delay.
[0151] "Image recognition means" refers to technology that detects specific features from digital images and videos and identifies them.
[0152] "Voice guide generation means" refers to the technology and equipment used to generate necessary voice guidance based on analyzed information.
[0153] "Emotion recognition means" refers to methods and technologies for identifying an emotional state by analyzing a user's facial expressions, tone of voice, and reactions.
[0154] "Voice input analysis means" refers to technology that receives voice instructions from a user, analyzes the content, and understands its meaning.
[0155] "Automated operation means" refers to devices or technologies that automatically perform certain procedures or operations based on input instructions.
[0156] "Audio playback means" refers to the technology or device used to play back generated audio to the user.
[0157] "Haptic feedback means" refers to technologies that provide users with tactile responses and stimuli, complementing the interaction experience.
[0158] This invention provides a system that enables visually impaired users to comfortably enjoy interactive computer games. Specific embodiments of this system are described below.
[0159] The system's core consists of servers and terminals.
[0160] First, the server acquires real-time screen data from the game. This uses image recognition algorithms (e.g., OpenCV, TENSORFLOW®) to process visual information. The server analyzes the image using the image recognition means and identifies important elements within the game. Based on this analysis information, the server uses an audio guide generation means to generate situational audio guides. Speech synthesis technology (e.g., Google Cloud Text-to-Speech) is used to generate the audio.
[0161] The generated audio guide is converted from text to audio and sent from the server to the terminal. The terminal then provides this to the user through an audio playback device. The audio is played back through a headset or speaker. The terminal also uses haptic feedback to provide feedback to the user according to specific game situations. For example, if an enemy is approaching, the terminal uses vibration to warn the user.
[0162] Furthermore, the system incorporates an emotion engine, allowing the device to recognize emotions in real time through the user's facial expressions and voice tone. This recognition utilizes facial recognition and voice analysis technologies (e.g., Face API, IBM Watson® Tone Analyzer). The recognized emotion data is sent to a server, which uses this information to adaptively adjust the tone of the voice guide and haptic feedback.
[0163] For example, if a user is feeling stressed, the tone of the voice guidance can be softened, and the haptic feedback can be adjusted to be more gradual. Users can also give voice commands for in-game actions. These voice commands are analyzed by the device's voice input analysis system, and the instructions are sent to the server. The server then uses automated control systems to execute the in-game actions based on these commands.
[0164] A concrete example of a prompt message could be, "Generate an audio guide that matches the user's current emotional state to reduce the stress they experience." Following this prompt, the generating AI model provides an interface suitable for the user. As a result, implementing this system makes it possible for visually impaired users to enjoy an immersive gaming experience.
[0165] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0166] Step 1:
[0167] The server takes real-time game screen data as input. This input data is analyzed by image recognition algorithms (e.g., OpenCV, TensorFlow) to identify important elements within the game (such as enemy characters and items). The output is a list of the identified elements and their locations. This information is then used in the next step within the server.
[0168] Step 2:
[0169] The server uses the analysis results obtained in Step 1 as input to create an audio guide using an audio guide generation system. Specifically, it generates instructions and explanations in text format that correspond to the in-game situation, and then converts this into speech using speech synthesis technology (e.g., Google Cloud Text-to-Speech). The output is the generated audio data, which is then ready to be presented to the user.
[0170] Step 3:
[0171] The server processes the generated audio data for transmission to the terminal. The audio data is transmitted in real time over a stable network. The input is the audio data generated in step 2, and the output is the audio communication data delivered to the terminal.
[0172] Step 4:
[0173] The terminal receives audio data from the server as input and provides audio guidance to the user via an audio playback device (e.g., a headset or speaker). Specifically, the terminal triggers the start of audio playback and automatically adjusts the volume and sound quality as needed. The output is the audio guidance that the user hears.
[0174] Step 5:
[0175] The device uses vibration motors and other mechanisms to provide haptic feedback to the user, depending on the game situation. Specifically, it will vibrate when an enemy approaches, for example. The input is game situation information from the server, and the output is the haptic stimulus felt by the user.
[0176] Step 6:
[0177] The device uses the user's face and voice as input and analyzes their emotions in real time using emotion recognition technology. Specifically, it collects data using a face recognition camera and microphone, and identifies the user's emotional state using an analysis algorithm (e.g., Face API, voice tone analysis technology). The output is the user's emotion data, which is used in the next step.
[0178] Step 7:
[0179] Based on the emotional data obtained in step 6, the server dynamically adjusts the content of the voice guidance and haptic feedback. For example, if the user is stressed, the tone of the voice guidance may be softened or the vibrations may be reduced. The input is the user's emotional data, and the output is the adjusted interface settings.
[0180] Step 8:
[0181] The user controls in-game actions by inputting voice commands. The input voice commands are analyzed by the terminal's voice input analysis system and sent to the server. Based on this information, the server uses an automated control system to execute in-game actions. The output is the action actually performed in the game based on the user's instructions.
[0182] (Application Example 2)
[0183] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".
[0184] For visually impaired users to enjoy games and entertainment with a sense of immersion, there is a need for technology that goes beyond simply supplementing audiovisual information and provides rich, interactive experiences that take into account the user's emotional state. However, current technologies do not make real-time adjustments in response to the user's emotions, resulting in a limited user experience.
[0185] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0186] In this invention, the server includes image recognition means for analyzing visual information in real time via a user interface, voice guide generation means for generating voice guides based on the analyzed visual information, and emotion analysis means for analyzing the user's emotional state and adjusting the voice guides and feedback based on the analysis results. This makes it possible to provide appropriate voice guides and haptic feedback according to the user's emotional state, realizing an immersive and intuitive interaction.
[0187] A "user interface" is a mechanism that provides a point of contact for users to interact with a system and exchange information.
[0188] "Visual information" refers to data related to vision, such as images and videos.
[0189] "Image recognition means" refers to technologies for automatically identifying specific features or objects from visual information.
[0190] A "voice guide generation means" is a method for providing instructions and guidance to a user through voice, based on analyzed visual information.
[0191] "Emotion analysis methods" refer to technologies that estimate a user's emotional state from their facial expressions and voice, and then analyze that information.
[0192] "Voice input analysis means" refers to technology that analyzes voice commands received from a user and converts them into an appropriate digital format.
[0193] An "automatic operation means" is a mechanism for automatically performing specific actions or operations based on voice instructions or other input information.
[0194] "Audio playback means" refers to devices or technologies that allow users to listen to the generated audio guide.
[0195] A "tactile feedback means" is a method of providing physical tactile stimuli to a user and transmitting information.
[0196] This invention is a system for visually impaired users to comfortably enjoy games and entertainment. The system consists of a server and terminals, which exchange information with each other to provide the user with an immersive experience. Specifically, the invention will be implemented in the following form.
[0197] Server Role
[0198] The server acquires visual information from external sources via the user interface and analyzes this information using image recognition. Based on this analysis, the voice guide generation system creates a situation-appropriate voice guide and sends it to the terminal. Furthermore, the server is equipped with emotion analysis capabilities, which analyze the user's voice and video sent from the terminal to understand the user's emotions. Based on this emotion data, it is possible to adjust the content of the voice guide and haptic feedback.
[0199] Terminal role
[0200] The device is equipped with audio playback and haptic feedback mechanisms, providing the user with audio guides sent from the server. The audio guides have a system that converts text to speech, delivering information to the user in real time. Furthermore, the device can provide feedback through the haptic feedback mechanism depending on specific game situations. For example, it can issue a warning using vibration when an enemy approaches.
[0201] User interaction
[0202] Users can input their voices as instructions into the device through a voice input analysis system. These instructions are sent to a server, and appropriate in-game actions are executed by an automated system. This allows users to experience flexible and intuitive interactions that respond to their emotional state.
[0203] Specific example
[0204] For example, if a user says to their device, "I want to play a fictional adventure game," the server receives this instruction and starts a suitable game. Also, if the user shows signs of anxiety, the server uses emotion analysis to detect this and sends a voice guide such as, "Take a deep breath and enjoy this scene."
[0205] Example of a prompt
[0206] "Please generate voice messages to provide emotionally resonant and gentle advice to visually impaired users."
[0207] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0208] Step 1:
[0209] The server acquires visual information from an external source. The acquired data is sent to an image recognition system to identify specific features or objects. The input is visual information, and the output is the identified in-game elements.
[0210] Step 2:
[0211] The server generates an audio guide based on the information identified in step 1. Using the audio guide generation means, it creates a guide in text format corresponding to the identified elements. This text data is converted into audio, and the generated audio is sent to the terminal. The input is the in-game elements, and the output is the audio guide.
[0212] Step 3:
[0213] The terminal provides the user with audio guides transmitted from the server via an audio playback device. The audio is played back through a speaker or headset, conveying information to the user in real time. The input is the audio guide, and the output is audio information.
[0214] Step 4:
[0215] The terminal activates haptic feedback mechanisms based on instructions from the server. This provides feedback to the user by generating vibrations or pressures appropriate to specific game situations. The input is a situation-specific feedback command, and the output is haptic information.
[0216] Step 5:
[0217] The user sends voice instructions to the terminal. Through a voice input analysis system, the voice data is converted to text and sent to the server. The input is the user's voice, and the output is the textualized instructions.
[0218] Step 6:
[0219] The server analyzes the user's instructions and automatically executes the corresponding in-game actions. This process ensures that user instructions are quickly reflected in the game. Input is text instructions, and output is game actions.
[0220] Step 7:
[0221] The server analyzes user emotion data acquired from the terminal. Using emotion analysis tools, it infers the user's emotions from voice and facial expression data and adjusts the content of voice guidance and haptic feedback accordingly. The input is emotion data, and the output is the adjusted guidance and feedback.
[0222] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.
[0223] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0224] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.
[0225] [Second Embodiment]
[0226] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.
[0227] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.
[0228] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0229] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.
[0230] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0231] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0232] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0233] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0234] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0235] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0236] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0237] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0238] This invention is a system that enables visually impaired individuals to enjoy games in the same way as sighted individuals, and it consists of multiple functional modules. This system operates in cooperation with a server and terminals.
[0239] First, the server is equipped with image recognition capabilities to analyze real-time game screen data. These image recognition capabilities identify important elements within the game, such as characters, obstacles, and enemies, and digitize the position and status of each element. For example, in an action game, the server identifies the movement of enemies around the player character and determines which direction the user should focus their attention.
[0240] Next, the server uses the voice guide generation mechanism to generate voice guides appropriate to each situation based on the acquired data. These voice guides contain important information for the user and serve as a guide for action. For example, when an enemy approaches, specific instructions such as "Move to the left" are generated using speech synthesis.
[0241] This audio guide is transmitted to the device and provided to the user through an audio playback device. The user listens to the audio guide using a headset or similar device. Additionally, haptic feedback devices are activated as needed, providing intuitive information to the user through vibrations and other means. For example, when a specific action is required, the device uses vibration to provide feedback on the timing.
[0242] Furthermore, users input game commands by voice using voice input analysis via a terminal. When a user speaks instructions such as "attack" or "move right," the terminal recognizes them and sends the analysis results to the server. Based on this information, the server uses automated control mechanisms to immediately execute the corresponding action in the game.
[0243] This allows visually impaired users to understand the gameplay through audio and haptic feedback, without relying on visual information, and to have a direct and interactive experience. This system is particularly useful in situations where real-time responsiveness is required, such as in action games, as it can respond to rapidly changing game situations, thereby improving the user's gameplay experience.
[0244] The following describes the processing flow.
[0245] Step 1:
[0246] The server receives real-time screen data from the game. It captures the game's rendering frames and begins analysis using image recognition. Here, it identifies objects such as enemy characters, player characters, obstacles, and items, and extracts their positional information and movement patterns.
[0247] Step 2:
[0248] The server constructs an audio guide based on the analysis results. The audio guide generation system creates appropriate instructions according to the game situation and formats them as text data. For example, if an enemy is approaching from the right, it will create a guide that says, "An enemy is coming from the right. Please dodge." This text is then prepared to be converted into audio.
[0249] Step 3:
[0250] The server generates the audio guide and sends it to the terminal as a digital audio file. Here, text-to-speech technology is used to convert the text into audio data. The terminal receives this audio data and prepares to output the audio to the user.
[0251] Step 4:
[0252] The device transmits received audio guidance to the user using an audio playback device. The user listens to the audio guidance through a headset or speaker to understand the game situation. Simultaneously, haptic feedback devices generate vibrations to convey urgent information or prompt specific actions.
[0253] Step 5:
[0254] The user issues voice commands through the device. If the user commands a specific action, such as "jump" or "attack," the voice is recorded by the device.
[0255] Step 6:
[0256] The terminal analyzes the user's voice commands using voice input analysis technology. The analysis results are sent to the server as digital commands. This converts the voice content into specific game instructions.
[0257] Step 7:
[0258] Based on the received commands, the server sends instructions to the game engine via automated control mechanisms to execute corresponding actions. This allows the in-game characters to immediately perform actions such as "jump" or "attack."
[0259] Step 8:
[0260] The server and terminal work together, repeating these steps in real time, enabling users to continuously and intuitively control and enjoy the game.
[0261] (Example 1)
[0262] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0263] The challenge is to provide an environment where visually impaired individuals can enjoy games that require complex real-time interaction without relying on visual information. It is necessary to establish effective means for users who have difficulty directly obtaining visual information to grasp the game situation and take appropriate actions.
[0264] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0265] In this invention, the server includes data analysis means, information conversion means, and input analysis means. This converts visual information into audio and haptic feedback, enabling the user to play the game in real time.
[0266] "Data analysis means" refers to a device or software that has the function of acquiring visual information in real time, analyzing that information, and converting it into meaningful data.
[0267] An "information conversion means" is a device or software that generates information to be conveyed to the user based on analyzed data and provides it primarily in the form of sound or haptic feedback.
[0268] "Input analysis means" refers to a device or software that receives voice instructions from a user, analyzes their content, and converts them into a format suitable for control within the system.
[0269] "Control means" refers to a device or software that has the function of performing a specified action based on instructions from the user.
[0270] "Playback means" refers to a device or software that actually outputs the generated audio guide as audio.
[0271] A "feedback mechanism" is a device or software that has the function of providing feedback to the user through touch, conveying intuitive information to the user through vibrations or other means.
[0272] This invention is a system that enables visually impaired individuals to enjoy games without relying on their vision. The server and terminal work together to provide a comfortable interactive environment for the user.
[0273] The server acquires real-time screen data from the game and analyzes it using data analysis tools. Specifically, it uses image recognition software (e.g., a general-purpose library) to identify elements in the game such as characters, obstacles, and enemies. The obtained information is converted into audio guidance by an information conversion tool, and instructions are generated for the user using speech synthesis software such as Google Text-to-Speech. For example, if a character is in danger, it might create an audio guide saying, "Move to the left to avoid danger."
[0274] This audio guide is transmitted to the device and provided to the user through the device's playback mechanism. The user listens to it through an audio device such as a headset. The device also has a feedback mechanism that vibrates according to instructions from the server to communicate in-game timing and actions to the user. The device vibrates when a specific enemy approaches, allowing the user to intuitively understand which direction to focus their attention.
[0275] Users can give instructions to their devices using voice commands. These voice commands are analyzed by an input analysis system and converted into text data. This text is then sent to the server, where the required actions in the game are executed through an automated control function. For example, if a user says "jump," the server will immediately configure the system to reflect that command in the game.
[0276] This system allows users to enjoy games in real time. The generative AI model generates prompts tailored to user needs, improving operational efficiency and the user experience. An example of a prompt might be, "Designing a real-time audio-guided game system for the visually impaired."
[0277] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0278] Step 1:
[0279] The server obtains screen data from the game in real time. As input, there is screen information through the game's API, and the output is image data that can be analyzed. The server receives this image data to use in the next step and stores it in the processing queue.
[0280] Step 2:
[0281] The server analyzes the acquired image data using data analysis means. The input is the image data obtained in Step 1, and the output is element information such as characters, obstacles, and enemies in the game. For this, an image recognition library (e.g., a general-purpose library) is used to identify each element and extract its position and movement as numerical data.
[0282] Step 3:
[0283] Based on the analysis results, the server uses information conversion means to generate a voice guide. The input is the element information obtained in Step 2, and the output is the voice guide text provided to the user. The server uses voice synthesis software such as Google Text-to-Speech to generate specific instructions in voice form, such as "You are approaching an obstacle. Please move to the left."
[0284] Step 4:
[0285] The terminal receives the voice guide sent from the server and provides it to the user through playback means. The input is the voice guide data from the server, and the output is the voice transmitted to the user. Here, the terminal performs the operation of playing the voice through a headset and also uses the vibration function of the terminal as needed to convey information as tactile feedback.
[0286] Step 5:
[0287] The user inputs commands into the terminal by voice. The input is the user's voice instructions, and the output is command data in text format. The terminal analyzes this voice using an input parsing device and prepares to send it to the server. For example, a command such as "jump" is extracted from the voice.
[0288] Step 6:
[0289] The server receives the voice analysis results and executes in-game actions using its automated control function. The input is the text command data from step 5, and the output is the execution of the action in the game. The server processes this in real time and sends instructions to make the avatar move according to the user's commands.
[0290] (Application Example 1)
[0291] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0292] There is a need to provide means of support that enable visually impaired people to move more safely and efficiently in their daily lives. In particular, systems that can respond quickly to changes in the environment and obstacles are necessary.
[0293] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0294] In this invention, the server includes an image analysis means for analyzing environmental information in real time via a user interface, a voice instruction generation means for generating voice instructions based on the analyzed visual information, and a voice analysis means for receiving, analyzing, and converting voice instructions from the user into action commands. This makes it possible for visually impaired people to understand their surroundings and select appropriate actions.
[0295] A "user interface" is a means for a user to interact with a system, and is a physical or logical component for inputting and outputting information.
[0296] "Environmental information" refers to information that influences user behavior, such as the surrounding environment, the location of obstacles, and the movement path.
[0297] "Real-time analysis" refers to a processing method that immediately handles the ongoing situation and makes the results immediately available.
[0298] "Image analysis means" refers to technical means for analyzing visual data obtained from cameras and sensors and extracting important information.
[0299] "Voice instruction generation means" refers to a technical means for generating voice messages that instruct the user to take action based on analyzed data.
[0300] "Voice analysis means" refers to the process of analyzing voice commands received from a user and understanding their content.
[0301] "Automatic control means" refers to technical means that automatically manage and control the operation of a system or device based on user input and environmental information.
[0302] The system necessary to implement this invention primarily involves the user, server, and terminal each playing a specific role. The server uses a 360-degree camera and a LiDAR sensor to capture information about the surrounding environment in real time. This allows it to acquire visual information about the environment in which the user is located, and analyze this data using OpenCV. Based on the information obtained from the analysis, the voice instruction generation means generates voice instructions using a TTS (Text-to-Speech) engine.
[0303] The terminal is a device for presenting this generated voice instruction to the user. The terminal is equipped with acoustic playback means, which is used to convey the generated voice instruction to the user in an easy-to-understand manner. Thus, the user can receive information immediately. In addition, the terminal has tactile feedback means and can provide tactile information to the user by using a vibration motor or the like.
[0304] The user can perform voice input through the terminal. Thereby, the voice analysis means analyzes the user's instruction and converts it into an appropriate action command. The command content is transmitted to the server, and the automatic control means executes an appropriate action based on the content. This makes it possible to assist visually impaired people in understanding their environment and moving quickly and safely.
[0305] As a specific example, a scenario where a visually impaired person walks safely in the street can be considered. When the server scans the surrounding situation and detects that the sidewalk is under construction, the voice instruction generation means generates a specific instruction such as "Let's enter the narrow path on the right" to assist the user in making a safe detour. In this way, information can be provided immediately and actions can be prompted.
[0306] Examples of prompt sentences include "Let the robot guide the optimal detour route in real time based on visual information" and "Let the robot execute the necessary actions based on the user's voice command". Thereby, the functions of the system can be utilized to the maximum extent.
[0307] The flow of the specific process in Application Example 1 will be described with reference to FIG. 12.
[0308] Step 1:
[0309] The server continuously captures information about the surrounding environment using a 360-degree camera and LiDAR sensors. This input data is sent to the server as visual data. The server uses OpenCV to analyze this data and identify important elements such as pedestrians, obstacles, and direction of movement. The analysis yields data about the position and movement of each element in the environment.
[0310] Step 2:
[0311] The server generates voice instructions using a TTS engine based on the analyzed visual information. For example, if it detects an obstacle ahead, it will generate a voice instruction such as "Move to the right." The input is the server's analyzed data, and the output is voice data presented to the user.
[0312] Step 3:
[0313] The terminal receives audio data transmitted from the server and uses an audio playback device to present voice instructions to the user. In this case, the terminal's role is to convey the voice instructions to the user in real time. The output is the voice instructions that the user actually hears.
[0314] Step 4:
[0315] The terminal simultaneously utilizes haptic feedback mechanisms, such as vibration motors, to provide tactile information to the user. This allows the terminal to warn the user of situations requiring particular attention. The input is a control command received from the server, and the output is the vibration action performed by the terminal.
[0316] Step 5:
[0317] The user makes decisions and chooses actions based on voice instructions, and issues voice commands to the terminal as needed. The user's voice commands are then sent back to the terminal as input.
[0318] Step 6:
[0319] The terminal receives voice input from the user and interprets its content using voice analysis tools. The analysis results are sent to the server. The input is the user's voice command, and the output is communication data to the server.
[0320] Step 7:
[0321] The server updates the system's operation and instructions using automated control mechanisms based on the analysis results. This adjusts the overall system behavior according to user requests. The input is voice analysis data from the terminal, and the output is the updated control commands.
[0322] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0323] This invention provides a system that enables visually impaired users to enjoy games in an immersive way, and in particular, by combining it with an emotion engine, it provides adaptive interaction that takes the user's emotions into consideration.
[0324] The system operates primarily through a server and a terminal. First, the server acquires real-time screen data from the game, analyzes it using image recognition, and identifies important elements within the game. Based on this analysis, an audio guide generation system creates situation-appropriate audio guides. The generated guides are converted from text to audio and sent from the server to the terminal.
[0325] The device provides the user with audio guidance obtained through audio playback means. The user can receive information via a headset or speaker. It also uses haptic feedback means to provide the user with feedback tailored to specific game situations. For example, if an enemy is approaching, the device will vibrate to warn the user.
[0326] Furthermore, the system incorporates an emotion engine, allowing the device to recognize emotions in real time through the user's facial expressions and tone of voice. The recognized emotion data is sent to a server, which then adjusts the content of the voice guide and the settings of the haptic feedback based on that information. For example, if the user is feeling stressed, the tone of the voice guide can be made calmer, or the feedback can be made gentler.
[0327] Users can give voice commands for in-game actions, and a voice input analysis system analyzes these commands and sends them to the server. The server then uses an automated system to execute the in-game actions based on this information. This enables interaction that takes user emotions into account, resulting in comfortable and intuitive gameplay. The system iterates through these processes in real time, responding flexibly to user emotions and game progress.
[0328] The following describes the processing flow.
[0329] Step 1:
[0330] The server captures real-time screen data from the game and begins analysis through image recognition. This process identifies characters, obstacles, background information, etc., within the game, and based on this, acquires location information and situational data.
[0331] Step 2:
[0332] The server activates the audio guide generation system based on the analyzed data and generates the necessary audio guides as text. These guides include practical instructions and feedback for the player. For example, instructions such as "There is an enemy at 3 o'clock" might be created.
[0333] Step 3:
[0334] The server converts the generated audio guide into audio data and sends it to the terminal as a digital audio file. The file is formatted for playback on the terminal.
[0335] Step 4:
[0336] The terminal provides the user with received audio data using an audio playback device. The user receives this data through a headset and can obtain verbal instructions. Simultaneously, haptic feedback devices provide additional information through vibrations as needed.
[0337] Step 5:
[0338] The device activates an emotion engine and analyzes the user's facial expressions and tone of voice to recognize the user's emotional state in real time. This information is used to further improve the user's gameplay experience.
[0339] Step 6:
[0340] The recognized emotion data is sent to a server, which then adjusts the tone and content of the voice guide based on that information. It also changes the intensity and pattern of haptic feedback as needed, customizing the experience to match the user's emotions.
[0341] Step 7:
[0342] The user inputs voice commands into the terminal. For example, instructions such as "jump" or "attack" are spoken aloud. The terminal receives the voice and converts it into text using a voice input analysis device.
[0343] Step 8:
[0344] The terminal sends the analyzed voice commands to the server, and upon receiving this information, the server uses automated control mechanisms to execute the corresponding actions within the game. The server's instructions enable the character's specific movements to be realized in real time.
[0345] Step 9:
[0346] By repeating these steps in sequence, users can intuitively control and enjoy the game while responding to emotional changes. An adaptive system based on user feedback ensures a comfortable gameplay experience.
[0347] (Example 2)
[0348] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0349] It is difficult for visually impaired users to effectively enjoy computer games that typically rely on visual interaction. In particular, they require interfaces that appropriately adjust according to the game's progress and the player's emotional state, but no existing systems can achieve this. Therefore, there is a need for a system that considers the user's emotions, provides real-time in-game information, and allows the user to operate it intuitively.
[0350] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0351] In this invention, the server includes an image recognition means for analyzing visual information in real time, a means for generating audio guides and performing emotion recognition based on the analyzed visual information, and a means for receiving and analyzing voice instructions from the user. This allows visually impaired users to immerse themselves in the game while reducing emotional burden, and enables the user interface to flexibly adapt to the user's instantaneous emotions and instructions.
[0352] A "user interface" refers to the means or methods by which a user interacts with a system or application.
[0353] "Visual information" refers to all data that can be represented as images or videos, and this includes characters, backgrounds, and actions within a game.
[0354] "Real-time" refers to processing or responding within the actual time frame, with virtually no delay.
[0355] "Image recognition means" refers to technology that detects specific features from digital images and videos and identifies them.
[0356] "Voice guide generation means" refers to the technology and equipment used to generate necessary voice guidance based on analyzed information.
[0357] "Emotion recognition means" refers to methods and technologies for identifying an emotional state by analyzing a user's facial expressions, tone of voice, and reactions.
[0358] "Voice input analysis means" refers to technology that receives voice instructions from a user, analyzes the content, and understands its meaning.
[0359] "Automated operation means" refers to devices or technologies that automatically perform certain procedures or operations based on input instructions.
[0360] "Audio playback means" refers to the technology or device used to play back generated audio to the user.
[0361] "Haptic feedback means" refers to technologies that provide users with tactile responses and stimuli, complementing the interaction experience.
[0362] This invention provides a system that enables visually impaired users to comfortably enjoy interactive computer games. Specific embodiments of this system are described below.
[0363] The system's core consists of servers and terminals.
[0364] First, the server acquires real-time screen data from the game. This uses image recognition algorithms (e.g., OpenCV, TensorFlow, etc.) to process visual information. The server analyzes the image using the image recognition means and identifies important elements within the game. Based on this analysis, the server uses an audio guide generation means to generate situational audio guides. Speech synthesis technology (e.g., Google Cloud Text-to-Speech) is used to generate the audio.
[0365] The generated audio guide is converted from text to audio and sent from the server to the terminal. The terminal then provides this to the user through an audio playback device. The audio is played back through a headset or speaker. The terminal also uses haptic feedback to provide feedback to the user according to specific game situations. For example, if an enemy is approaching, the terminal uses vibration to warn the user.
[0366] Furthermore, the system incorporates an emotion engine, allowing the device to recognize emotions in real time through the user's facial expressions and voice tone. This recognition utilizes facial recognition and voice analysis technologies (e.g., Face API, IBM Watson Tone Analyzer). The recognized emotion data is sent to a server, which uses this information to adaptively adjust the tone of the voice guide and haptic feedback.
[0367] For example, if a user is feeling stressed, the tone of the voice guidance can be softened, and the haptic feedback can be adjusted to be more gradual. Users can also give voice commands for in-game actions. These voice commands are analyzed by the device's voice input analysis system, and the instructions are sent to the server. The server then uses automated control systems to execute the in-game actions based on these commands.
[0368] A concrete example of a prompt message could be, "Generate an audio guide that matches the user's current emotional state to reduce the stress they experience." Following this prompt, the generating AI model provides an interface suitable for the user. As a result, implementing this system makes it possible for visually impaired users to enjoy an immersive gaming experience.
[0369] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0370] Step 1:
[0371] The server takes real-time game screen data as input. This input data is analyzed by image recognition algorithms (e.g., OpenCV, TensorFlow) to identify important elements within the game (such as enemy characters and items). The output is a list of the identified elements and their locations. This information is then used in the next step within the server.
[0372] Step 2:
[0373] The server uses the analysis results obtained in Step 1 as input to create an audio guide using an audio guide generation system. Specifically, it generates instructions and explanations in text format that correspond to the in-game situation, and then converts this into speech using speech synthesis technology (e.g., Google Cloud Text-to-Speech). The output is the generated audio data, which is then ready to be presented to the user.
[0374] Step 3:
[0375] The server processes the generated audio data for transmission to the terminal. The audio data is transmitted in real time over a stable network. The input is the audio data generated in step 2, and the output is the audio communication data delivered to the terminal.
[0376] Step 4:
[0377] The terminal receives audio data from the server as input and provides audio guidance to the user via an audio playback device (e.g., a headset or speaker). Specifically, the terminal triggers the start of audio playback and automatically adjusts the volume and sound quality as needed. The output is the audio guidance that the user hears.
[0378] Step 5:
[0379] The device uses vibration motors and other mechanisms to provide haptic feedback to the user, depending on the game situation. Specifically, it will vibrate when an enemy approaches, for example. The input is game situation information from the server, and the output is the haptic stimulus felt by the user.
[0380] Step 6:
[0381] The device uses the user's face and voice as input and analyzes their emotions in real time using emotion recognition technology. Specifically, it collects data using a face recognition camera and microphone, and identifies the user's emotional state using an analysis algorithm (e.g., Face API, voice tone analysis technology). The output is the user's emotion data, which is used in the next step.
[0382] Step 7:
[0383] Based on the emotional data obtained in step 6, the server dynamically adjusts the content of the voice guidance and haptic feedback. For example, if the user is stressed, the tone of the voice guidance may be softened or the vibrations may be reduced. The input is the user's emotional data, and the output is the adjusted interface settings.
[0384] Step 8:
[0385] The user controls in-game actions by inputting voice commands. The input voice commands are analyzed by the terminal's voice input analysis system and sent to the server. Based on this information, the server uses an automated control system to execute in-game actions. The output is the action actually performed in the game based on the user's instructions.
[0386] (Application Example 2)
[0387] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0388] For visually impaired users to enjoy games and entertainment with a sense of immersion, there is a need for technology that goes beyond simply supplementing audiovisual information and provides rich, interactive experiences that take into account the user's emotional state. However, current technologies do not make real-time adjustments in response to the user's emotions, resulting in a limited user experience.
[0389] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0390] In this invention, the server includes image recognition means for analyzing visual information in real time via a user interface, voice guide generation means for generating voice guides based on the analyzed visual information, and emotion analysis means for analyzing the user's emotional state and adjusting the voice guides and feedback based on the analysis results. This makes it possible to provide appropriate voice guides and haptic feedback according to the user's emotional state, realizing an immersive and intuitive interaction.
[0391] A "user interface" is a mechanism that provides a point of contact for users to interact with a system and exchange information.
[0392] "Visual information" refers to data related to vision, such as images and videos.
[0393] "Image recognition means" refers to technologies for automatically identifying specific features or objects from visual information.
[0394] A "voice guide generation means" is a method for providing instructions and guidance to a user through voice, based on analyzed visual information.
[0395] "Emotion analysis methods" refer to technologies that estimate a user's emotional state from their facial expressions and voice, and then analyze that information.
[0396] "Voice input analysis means" refers to technology that analyzes voice commands received from a user and converts them into an appropriate digital format.
[0397] An "automatic operation means" is a mechanism for automatically performing specific actions or operations based on voice instructions or other input information.
[0398] "Audio playback means" refers to devices or technologies that allow users to listen to the generated audio guide.
[0399] A "tactile feedback means" is a method of providing physical tactile stimuli to a user and transmitting information.
[0400] This invention is a system for visually impaired users to comfortably enjoy games and entertainment. The system consists of a server and terminals, which exchange information with each other to provide the user with an immersive experience. Specifically, the invention will be implemented in the following form.
[0401] Server Role
[0402] The server acquires visual information from external sources via the user interface and analyzes this information using image recognition. Based on this analysis, the voice guide generation system creates a situation-appropriate voice guide and sends it to the terminal. Furthermore, the server is equipped with emotion analysis capabilities, which analyze the user's voice and video sent from the terminal to understand the user's emotions. Based on this emotion data, it is possible to adjust the content of the voice guide and haptic feedback.
[0403] Terminal role
[0404] The device is equipped with audio playback and haptic feedback mechanisms, providing the user with audio guides sent from the server. The audio guides have a system that converts text to speech, delivering information to the user in real time. Furthermore, the device can provide feedback through the haptic feedback mechanism depending on specific game situations. For example, it can issue a warning using vibration when an enemy approaches.
[0405] User interaction
[0406] Users can input their voices as instructions into the device through a voice input analysis system. These instructions are sent to a server, and appropriate in-game actions are executed by an automated system. This allows users to experience flexible and intuitive interactions that respond to their emotional state.
[0407] Specific example
[0408] For example, if a user says to their device, "I want to play a fictional adventure game," the server receives this instruction and starts a suitable game. Also, if the user shows signs of anxiety, the server uses emotion analysis to detect this and sends a voice guide such as, "Take a deep breath and enjoy this scene."
[0409] Example of a prompt
[0410] "Please generate voice messages to provide emotionally resonant and gentle advice to visually impaired users."
[0411] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0412] Step 1:
[0413] The server acquires visual information from an external source. The acquired data is sent to an image recognition system to identify specific features or objects. The input is visual information, and the output is the identified in-game elements.
[0414] Step 2:
[0415] The server generates an audio guide based on the information identified in step 1. Using the audio guide generation means, it creates a guide in text format corresponding to the identified elements. This text data is converted into audio, and the generated audio is sent to the terminal. The input is the in-game elements, and the output is the audio guide.
[0416] Step 3:
[0417] The terminal provides the user with audio guides transmitted from the server via an audio playback device. The audio is played back through a speaker or headset, conveying information to the user in real time. The input is the audio guide, and the output is audio information.
[0418] Step 4:
[0419] The terminal activates haptic feedback mechanisms based on instructions from the server. This provides feedback to the user by generating vibrations or pressures appropriate to specific game situations. The input is a situation-specific feedback command, and the output is haptic information.
[0420] Step 5:
[0421] The user sends voice instructions to the terminal. Through a voice input analysis system, the voice data is converted to text and sent to the server. The input is the user's voice, and the output is the textualized instructions.
[0422] Step 6:
[0423] The server analyzes the user's instructions and automatically executes the corresponding in-game actions. This process ensures that user instructions are quickly reflected in the game. Input is text instructions, and output is game actions.
[0424] Step 7:
[0425] The server analyzes user emotion data acquired from the terminal. Using emotion analysis tools, it infers the user's emotions from voice and facial expression data and adjusts the content of voice guidance and haptic feedback accordingly. The input is emotion data, and the output is the adjusted guidance and feedback.
[0426] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0427] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0428] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.
[0429] [Third Embodiment]
[0430] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.
[0431] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.
[0432] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0433] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.
[0434] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0435] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0436] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0437] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0438] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0439] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0440] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0441] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".
[0442] This invention is a system that enables visually impaired individuals to enjoy games in the same way as sighted individuals, and it consists of multiple functional modules. This system operates in cooperation with a server and terminals.
[0443] First, the server is equipped with image recognition capabilities to analyze real-time game screen data. These image recognition capabilities identify important elements within the game, such as characters, obstacles, and enemies, and digitize the position and status of each element. For example, in an action game, the server identifies the movement of enemies around the player character and determines which direction the user should focus their attention.
[0444] Next, the server uses the voice guide generation mechanism to generate voice guides appropriate to each situation based on the acquired data. These voice guides contain important information for the user and serve as a guide for action. For example, when an enemy approaches, specific instructions such as "Move to the left" are generated using speech synthesis.
[0445] This audio guide is transmitted to the device and provided to the user through an audio playback device. The user listens to the audio guide using a headset or similar device. Additionally, haptic feedback devices are activated as needed, providing intuitive information to the user through vibrations and other means. For example, when a specific action is required, the device uses vibration to provide feedback on the timing.
[0446] Furthermore, users input game commands by voice using voice input analysis via a terminal. When a user speaks instructions such as "attack" or "move right," the terminal recognizes them and sends the analysis results to the server. Based on this information, the server uses automated control mechanisms to immediately execute the corresponding action in the game.
[0447] This allows visually impaired users to understand the gameplay through audio and haptic feedback, without relying on visual information, and to have a direct and interactive experience. This system is particularly useful in situations where real-time responsiveness is required, such as in action games, as it can respond to rapidly changing game situations, thereby improving the user's gameplay experience.
[0448] The following describes the processing flow.
[0449] Step 1:
[0450] The server receives real-time screen data from the game. It captures the game's rendering frames and begins analysis using image recognition. Here, it identifies objects such as enemy characters, player characters, obstacles, and items, and extracts their positional information and movement patterns.
[0451] Step 2:
[0452] The server constructs an audio guide based on the analysis results. The audio guide generation system creates appropriate instructions according to the game situation and formats them as text data. For example, if an enemy is approaching from the right, it will create a guide that says, "An enemy is coming from the right. Please dodge." This text is then prepared to be converted into audio.
[0453] Step 3:
[0454] The server generates the audio guide and sends it to the terminal as a digital audio file. Here, text-to-speech technology is used to convert the text into audio data. The terminal receives this audio data and prepares to output the audio to the user.
[0455] Step 4:
[0456] The device transmits received audio guidance to the user using an audio playback device. The user listens to the audio guidance through a headset or speaker to understand the game situation. Simultaneously, haptic feedback devices generate vibrations to convey urgent information or prompt specific actions.
[0457] Step 5:
[0458] The user issues voice commands through the device. If the user commands a specific action, such as "jump" or "attack," the voice is recorded by the device.
[0459] Step 6:
[0460] The terminal analyzes the user's voice commands using voice input analysis technology. The analysis results are sent to the server as digital commands. This converts the voice content into specific game instructions.
[0461] Step 7:
[0462] Based on the received commands, the server sends instructions to the game engine via automated control mechanisms to execute corresponding actions. This allows the in-game characters to immediately perform actions such as "jump" or "attack."
[0463] Step 8:
[0464] The server and terminal work together, repeating these steps in real time, enabling users to continuously and intuitively control and enjoy the game.
[0465] (Example 1)
[0466] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0467] The challenge is to provide an environment where visually impaired individuals can enjoy games that require complex real-time interaction without relying on visual information. It is necessary to establish effective means for users who have difficulty directly obtaining visual information to grasp the game situation and take appropriate actions.
[0468] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0469] In this invention, the server includes data analysis means, information conversion means, and input analysis means. This converts visual information into audio and haptic feedback, enabling the user to play the game in real time.
[0470] "Data analysis means" refers to a device or software that has the function of acquiring visual information in real time, analyzing that information, and converting it into meaningful data.
[0471] An "information conversion means" is a device or software that generates information to be conveyed to the user based on analyzed data and provides it primarily in the form of sound or haptic feedback.
[0472] "Input analysis means" refers to a device or software that receives voice instructions from a user, analyzes their content, and converts them into a format suitable for control within the system.
[0473] "Control means" refers to a device or software that has the function of performing a specified action based on instructions from the user.
[0474] "Playback means" refers to a device or software that actually outputs the generated audio guide as audio.
[0475] A "feedback mechanism" is a device or software that has the function of providing feedback to the user through touch, conveying intuitive information to the user through vibrations or other means.
[0476] This invention is a system that enables visually impaired individuals to enjoy games without relying on their vision. The server and terminal work together to provide a comfortable interactive environment for the user.
[0477] The server acquires real-time screen data from the game and analyzes it using data analysis tools. Specifically, it uses image recognition software (e.g., a general-purpose library) to identify elements in the game such as characters, obstacles, and enemies. The obtained information is converted into audio guidance by an information conversion tool, and instructions are generated for the user using speech synthesis software such as Google Text-to-Speech. For example, if a character is in danger, it might create an audio guide saying, "Move to the left to avoid danger."
[0478] This audio guide is transmitted to the device and provided to the user through the device's playback mechanism. The user listens to it through an audio device such as a headset. The device also has a feedback mechanism that vibrates according to instructions from the server to communicate in-game timing and actions to the user. The device vibrates when a specific enemy approaches, allowing the user to intuitively understand which direction to focus their attention.
[0479] Users can give instructions to their devices using voice commands. These voice commands are analyzed by an input analysis system and converted into text data. This text is then sent to the server, where the required actions in the game are executed through an automated control function. For example, if a user says "jump," the server will immediately configure the system to reflect that command in the game.
[0480] This system allows users to enjoy games in real time. The generative AI model generates prompts tailored to user needs, improving operational efficiency and the user experience. An example of a prompt might be, "Designing a real-time audio-guided game system for the visually impaired."
[0481] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0482] Step 1:
[0483] The server acquires screen data from the game in real time. The input is screen information received through the game's API, and the output is image data that can be analyzed. The server receives this image data for use in the next step and stores it in a processing queue.
[0484] Step 2:
[0485] The server uses data analysis tools to analyze the acquired image data. The input is the image data acquired in step 1, and the output is elemental information such as characters, obstacles, and enemies in the game. An image recognition library (e.g., a general-purpose library) is used to identify each element and extract its position and movement as numerical data.
[0486] Step 3:
[0487] Based on the analysis results, the server generates an audio guide using an information conversion mechanism. The input is the elemental information obtained in step 2, and the output is the audio guide text provided to the user. The server uses speech synthesis software such as Google Text-to-Speech to generate specific instructions in audio format, such as "You are approaching an obstacle. Move to the left."
[0488] Step 4:
[0489] The terminal receives audio guidance transmitted from the server and provides it to the user via a playback mechanism. The input is audio guidance data from the server, and the output is the audio transmitted to the user. Here, the terminal plays the audio through the headset and, if necessary, uses its vibration function to convey information as haptic feedback.
[0490] Step 5:
[0491] The user inputs commands into the terminal by voice. The input is the user's voice instructions, and the output is command data in text format. The terminal analyzes this voice using an input parsing device and prepares to send it to the server. For example, a command such as "jump" is extracted from the voice.
[0492] Step 6:
[0493] The server receives the voice analysis results and executes in-game actions using its automated control function. The input is the text command data from step 5, and the output is the execution of the action in the game. The server processes this in real time and sends instructions to make the avatar move according to the user's commands.
[0494] (Application Example 1)
[0495] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0496] There is a need to provide means of support that enable visually impaired people to move more safely and efficiently in their daily lives. In particular, systems that can respond quickly to changes in the environment and obstacles are necessary.
[0497] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0498] In this invention, the server includes an image analysis means for analyzing environmental information in real time via a user interface, a voice instruction generation means for generating voice instructions based on the analyzed visual information, and a voice analysis means for receiving, analyzing, and converting voice instructions from the user into action commands. This makes it possible for visually impaired people to understand their surroundings and select appropriate actions.
[0499] A "user interface" is a means for a user to interact with a system, and is a physical or logical component for inputting and outputting information.
[0500] "Environmental information" refers to information that influences user behavior, such as the surrounding environment, the location of obstacles, and the movement path.
[0501] "Real-time analysis" refers to a processing method that immediately handles the ongoing situation and makes the results immediately available.
[0502] "Image analysis means" refers to technical means for analyzing visual data obtained from cameras and sensors and extracting important information.
[0503] "Voice instruction generation means" refers to a technical means for generating voice messages that instruct the user to take action based on analyzed data.
[0504] "Voice analysis means" refers to the process of analyzing voice commands received from a user and understanding their content.
[0505] "Automatic control means" refers to technical means that automatically manage and control the operation of a system or device based on user input and environmental information.
[0506] The system necessary to implement this invention primarily involves the user, server, and terminal each playing a specific role. The server uses a 360-degree camera and a LiDAR sensor to capture information about the surrounding environment in real time. This allows it to acquire visual information about the environment in which the user is located, and analyze this data using OpenCV. Based on the information obtained from the analysis, the voice instruction generation means generates voice instructions using a TTS (Text-to-Speech) engine.
[0507] The terminal is a device for presenting these generated voice instructions to the user. The terminal is equipped with sound playback means, which are used to clearly convey the generated voice instructions to the user. Therefore, the user can receive the information immediately. In addition, the terminal has haptic feedback means, which can provide tactile information to the user using a vibration motor or similar device.
[0508] Users can input voice commands through a terminal. A voice analysis system then analyzes the user's instructions and converts them into appropriate action commands. These commands are sent to a server, and an automated control system executes the appropriate actions based on that information. This enables visually impaired individuals to understand their environment and move quickly and safely.
[0509] A concrete example is a scenario where a visually impaired person needs to walk safely in a city. When the server scans the surroundings and detects that the sidewalk is under construction, the voice instruction generation system generates specific instructions such as "Take the narrow path to the right," helping the user to safely detour. In this way, information can be provided immediately, prompting action.
[0510] Examples of prompts include, "Please have the robot guide you to the optimal detour route in real time based on visual information," and "Please have the robot perform the necessary actions based on the user's voice commands." This allows for maximum utilization of the system's capabilities.
[0511] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0512] Step 1:
[0513] The server continuously captures information about the surrounding environment using a 360-degree camera and LiDAR sensors. This input data is sent to the server as visual data. The server uses OpenCV to analyze this data and identify important elements such as pedestrians, obstacles, and direction of movement. The analysis yields data about the position and movement of each element in the environment.
[0514] Step 2:
[0515] The server generates voice instructions using a TTS engine based on the analyzed visual information. For example, if it detects an obstacle ahead, it will generate a voice instruction such as "Move to the right." The input is the server's analyzed data, and the output is voice data presented to the user.
[0516] Step 3:
[0517] The terminal receives audio data transmitted from the server and uses an audio playback device to present voice instructions to the user. In this case, the terminal's role is to convey the voice instructions to the user in real time. The output is the voice instructions that the user actually hears.
[0518] Step 4:
[0519] The terminal simultaneously utilizes haptic feedback mechanisms, such as vibration motors, to provide tactile information to the user. This allows the terminal to warn the user of situations requiring particular attention. The input is a control command received from the server, and the output is the vibration action performed by the terminal.
[0520] Step 5:
[0521] The user makes decisions and chooses actions based on voice instructions, and issues voice commands to the terminal as needed. The user's voice commands are then sent back to the terminal as input.
[0522] Step 6:
[0523] The terminal receives voice input from the user and interprets its content using voice analysis tools. The analysis results are sent to the server. The input is the user's voice command, and the output is communication data to the server.
[0524] Step 7:
[0525] The server updates the system's operation and instructions using automated control mechanisms based on the analysis results. This adjusts the overall system behavior according to user requests. The input is voice analysis data from the terminal, and the output is the updated control commands.
[0526] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0527] This invention provides a system that enables visually impaired users to enjoy games in an immersive way, and in particular, by combining it with an emotion engine, it provides adaptive interaction that takes the user's emotions into consideration.
[0528] The system operates primarily through a server and a terminal. First, the server acquires real-time screen data from the game, analyzes it using image recognition, and identifies important elements within the game. Based on this analysis, an audio guide generation system creates situation-appropriate audio guides. The generated guides are converted from text to audio and sent from the server to the terminal.
[0529] The device provides the user with audio guidance obtained through audio playback means. The user can receive information via a headset or speaker. It also uses haptic feedback means to provide the user with feedback tailored to specific game situations. For example, if an enemy is approaching, the device will vibrate to warn the user.
[0530] Furthermore, the system incorporates an emotion engine, allowing the device to recognize emotions in real time through the user's facial expressions and tone of voice. The recognized emotion data is sent to a server, which then adjusts the content of the voice guide and the settings of the haptic feedback based on that information. For example, if the user is feeling stressed, the tone of the voice guide can be made calmer, or the feedback can be made gentler.
[0531] Users can give voice commands for in-game actions, and a voice input analysis system analyzes these commands and sends them to the server. The server then uses an automated system to execute the in-game actions based on this information. This enables interaction that takes user emotions into account, resulting in comfortable and intuitive gameplay. The system iterates through these processes in real time, responding flexibly to user emotions and game progress.
[0532] The following describes the processing flow.
[0533] Step 1:
[0534] The server captures real-time screen data from the game and begins analysis through image recognition. This process identifies characters, obstacles, background information, etc., within the game, and based on this, acquires location information and situational data.
[0535] Step 2:
[0536] The server activates the audio guide generation system based on the analyzed data and generates the necessary audio guides as text. These guides include practical instructions and feedback for the player. For example, instructions such as "There is an enemy at 3 o'clock" might be created.
[0537] Step 3:
[0538] The server converts the generated audio guide into audio data and sends it to the terminal as a digital audio file. The file is formatted for playback on the terminal.
[0539] Step 4:
[0540] The terminal provides the user with received audio data using an audio playback device. The user receives this data through a headset and can obtain verbal instructions. Simultaneously, haptic feedback devices provide additional information through vibrations as needed.
[0541] Step 5:
[0542] The device activates an emotion engine and analyzes the user's facial expressions and tone of voice to recognize the user's emotional state in real time. This information is used to further improve the user's gameplay experience.
[0543] Step 6:
[0544] The recognized emotion data is sent to a server, which then adjusts the tone and content of the voice guide based on that information. It also changes the intensity and pattern of haptic feedback as needed, customizing the experience to match the user's emotions.
[0545] Step 7:
[0546] The user inputs voice commands into the terminal. For example, instructions such as "jump" or "attack" are spoken aloud. The terminal receives the voice and converts it into text using a voice input analysis device.
[0547] Step 8:
[0548] The terminal sends the analyzed voice commands to the server, and upon receiving this information, the server uses automated control mechanisms to execute the corresponding actions within the game. The server's instructions enable the character's specific movements to be realized in real time.
[0549] Step 9:
[0550] By repeating these steps in sequence, users can intuitively control and enjoy the game while responding to emotional changes. An adaptive system based on user feedback ensures a comfortable gameplay experience.
[0551] (Example 2)
[0552] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0553] It is difficult for visually impaired users to effectively enjoy computer games that typically rely on visual interaction. In particular, they require interfaces that appropriately adjust according to the game's progress and the player's emotional state, but no existing systems can achieve this. Therefore, there is a need for a system that considers the user's emotions, provides real-time in-game information, and allows the user to operate it intuitively.
[0554] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0555] In this invention, the server includes an image recognition means for analyzing visual information in real time, a means for generating audio guides and performing emotion recognition based on the analyzed visual information, and a means for receiving and analyzing voice instructions from the user. This allows visually impaired users to immerse themselves in the game while reducing emotional burden, and enables the user interface to flexibly adapt to the user's instantaneous emotions and instructions.
[0556] A "user interface" refers to the means or methods by which a user interacts with a system or application.
[0557] "Visual information" refers to all data that can be represented as images or videos, and this includes characters, backgrounds, and actions within a game.
[0558] "Real-time" refers to processing or responding within the actual time frame, with virtually no delay.
[0559] "Image recognition means" refers to technology that detects specific features from digital images and videos and identifies them.
[0560] "Voice guide generation means" refers to the technology and equipment used to generate necessary voice guidance based on analyzed information.
[0561] "Emotion recognition means" refers to methods and technologies for identifying an emotional state by analyzing a user's facial expressions, tone of voice, and reactions.
[0562] "Voice input analysis means" refers to technology that receives voice instructions from a user, analyzes the content, and understands its meaning.
[0563] "Automated operation means" refers to devices or technologies that automatically perform certain procedures or operations based on input instructions.
[0564] "Audio playback means" refers to the technology or device used to play back generated audio to the user.
[0565] "Haptic feedback means" refers to technologies that provide users with tactile responses and stimuli, complementing the interaction experience.
[0566] This invention provides a system that enables visually impaired users to comfortably enjoy interactive computer games. Specific embodiments of this system are described below.
[0567] The system's core consists of servers and terminals.
[0568] First, the server acquires real-time screen data from the game. This uses image recognition algorithms (e.g., OpenCV, TensorFlow, etc.) to process visual information. The server analyzes the image using the image recognition means and identifies important elements within the game. Based on this analysis, the server uses an audio guide generation means to generate situational audio guides. Speech synthesis technology (e.g., Google Cloud Text-to-Speech) is used to generate the audio.
[0569] The generated audio guide is converted from text to audio and sent from the server to the terminal. The terminal then provides this to the user through an audio playback device. The audio is played back through a headset or speaker. The terminal also uses haptic feedback to provide feedback to the user according to specific game situations. For example, if an enemy is approaching, the terminal uses vibration to warn the user.
[0570] Furthermore, the system incorporates an emotion engine, allowing the device to recognize emotions in real time through the user's facial expressions and voice tone. This recognition utilizes facial recognition and voice analysis technologies (e.g., Face API, IBM Watson Tone Analyzer). The recognized emotion data is sent to a server, which uses this information to adaptively adjust the tone of the voice guide and haptic feedback.
[0571] For example, if a user is feeling stressed, the tone of the voice guidance can be softened, and the haptic feedback can be adjusted to be more gradual. Users can also give voice commands for in-game actions. These voice commands are analyzed by the device's voice input analysis system, and the instructions are sent to the server. The server then uses automated control systems to execute the in-game actions based on these commands.
[0572] A concrete example of a prompt message could be, "Generate an audio guide that matches the user's current emotional state to reduce the stress they experience." Following this prompt, the generating AI model provides an interface suitable for the user. As a result, implementing this system makes it possible for visually impaired users to enjoy an immersive gaming experience.
[0573] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0574] Step 1:
[0575] The server takes real-time game screen data as input. This input data is analyzed by image recognition algorithms (e.g., OpenCV, TensorFlow) to identify important elements within the game (such as enemy characters and items). The output is a list of the identified elements and their locations. This information is then used in the next step within the server.
[0576] Step 2:
[0577] The server uses the analysis results obtained in Step 1 as input to create an audio guide using an audio guide generation system. Specifically, it generates instructions and explanations in text format that correspond to the in-game situation, and then converts this into speech using speech synthesis technology (e.g., Google Cloud Text-to-Speech). The output is the generated audio data, which is then ready to be presented to the user.
[0578] Step 3:
[0579] The server processes the generated audio data for transmission to the terminal. The audio data is transmitted in real time over a stable network. The input is the audio data generated in step 2, and the output is the audio communication data delivered to the terminal.
[0580] Step 4:
[0581] The terminal receives audio data from the server as input and provides audio guidance to the user via an audio playback device (e.g., a headset or speaker). Specifically, the terminal triggers the start of audio playback and automatically adjusts the volume and sound quality as needed. The output is the audio guidance that the user hears.
[0582] Step 5:
[0583] The device uses vibration motors and other mechanisms to provide haptic feedback to the user, depending on the game situation. Specifically, it will vibrate when an enemy approaches, for example. The input is game situation information from the server, and the output is the haptic stimulus felt by the user.
[0584] Step 6:
[0585] The device uses the user's face and voice as input and analyzes their emotions in real time using emotion recognition technology. Specifically, it collects data using a face recognition camera and microphone, and identifies the user's emotional state using an analysis algorithm (e.g., Face API, voice tone analysis technology). The output is the user's emotion data, which is used in the next step.
[0586] Step 7:
[0587] Based on the emotional data obtained in step 6, the server dynamically adjusts the content of the voice guidance and haptic feedback. For example, if the user is stressed, the tone of the voice guidance may be softened or the vibrations may be reduced. The input is the user's emotional data, and the output is the adjusted interface settings.
[0588] Step 8:
[0589] The user controls in-game actions by inputting voice commands. The input voice commands are analyzed by the terminal's voice input analysis system and sent to the server. Based on this information, the server uses an automated control system to execute in-game actions. The output is the action actually performed in the game based on the user's instructions.
[0590] (Application Example 2)
[0591] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0592] For visually impaired users to enjoy games and entertainment with a sense of immersion, there is a need for technology that goes beyond simply supplementing audiovisual information and provides rich, interactive experiences that take into account the user's emotional state. However, current technologies do not make real-time adjustments in response to the user's emotions, resulting in a limited user experience.
[0593] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0594] In this invention, the server includes image recognition means for analyzing visual information in real time via a user interface, voice guide generation means for generating voice guides based on the analyzed visual information, and emotion analysis means for analyzing the user's emotional state and adjusting the voice guides and feedback based on the analysis results. This makes it possible to provide appropriate voice guides and haptic feedback according to the user's emotional state, realizing an immersive and intuitive interaction.
[0595] A "user interface" is a mechanism that provides a point of contact for users to interact with a system and exchange information.
[0596] "Visual information" refers to data related to vision, such as images and videos.
[0597] "Image recognition means" refers to technologies for automatically identifying specific features or objects from visual information.
[0598] A "voice guide generation means" is a method for providing instructions and guidance to a user through voice, based on analyzed visual information.
[0599] "Emotion analysis methods" refer to technologies that estimate a user's emotional state from their facial expressions and voice, and then analyze that information.
[0600] "Voice input analysis means" refers to technology that analyzes voice commands received from a user and converts them into an appropriate digital format.
[0601] An "automatic operation means" is a mechanism for automatically performing specific actions or operations based on voice instructions or other input information.
[0602] "Audio playback means" refers to devices or technologies that allow users to listen to the generated audio guide.
[0603] A "tactile feedback means" is a method of providing physical tactile stimuli to a user and transmitting information.
[0604] This invention is a system for visually impaired users to comfortably enjoy games and entertainment. The system consists of a server and terminals, which exchange information with each other to provide the user with an immersive experience. Specifically, the invention will be implemented in the following form.
[0605] Server Role
[0606] The server acquires visual information from external sources via the user interface and analyzes this information using image recognition. Based on this analysis, the voice guide generation system creates a situation-appropriate voice guide and sends it to the terminal. Furthermore, the server is equipped with emotion analysis capabilities, which analyze the user's voice and video sent from the terminal to understand the user's emotions. Based on this emotion data, it is possible to adjust the content of the voice guide and haptic feedback.
[0607] Terminal role
[0608] The device is equipped with audio playback and haptic feedback mechanisms, providing the user with audio guides sent from the server. The audio guides have a system that converts text to speech, delivering information to the user in real time. Furthermore, the device can provide feedback through the haptic feedback mechanism depending on specific game situations. For example, it can issue a warning using vibration when an enemy approaches.
[0609] User interaction
[0610] Users can input their voices as instructions into the device through a voice input analysis system. These instructions are sent to a server, and appropriate in-game actions are executed by an automated system. This allows users to experience flexible and intuitive interactions that respond to their emotional state.
[0611] Specific example
[0612] For example, if a user says to their device, "I want to play a fictional adventure game," the server receives this instruction and starts a suitable game. Also, if the user shows signs of anxiety, the server uses emotion analysis to detect this and sends a voice guide such as, "Take a deep breath and enjoy this scene."
[0613] Example of a prompt
[0614] "Please generate voice messages to provide emotionally resonant and gentle advice to visually impaired users."
[0615] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0616] Step 1:
[0617] The server acquires visual information from an external source. The acquired data is sent to an image recognition system to identify specific features or objects. The input is visual information, and the output is the identified in-game elements.
[0618] Step 2:
[0619] The server generates an audio guide based on the information identified in step 1. Using the audio guide generation means, it creates a guide in text format corresponding to the identified elements. This text data is converted into audio, and the generated audio is sent to the terminal. The input is the in-game elements, and the output is the audio guide.
[0620] Step 3:
[0621] The terminal provides the user with audio guides transmitted from the server via an audio playback device. The audio is played back through a speaker or headset, conveying information to the user in real time. The input is the audio guide, and the output is audio information.
[0622] Step 4:
[0623] The terminal activates haptic feedback mechanisms based on instructions from the server. This provides feedback to the user by generating vibrations or pressures appropriate to specific game situations. The input is a situation-specific feedback command, and the output is haptic information.
[0624] Step 5:
[0625] The user sends voice instructions to the terminal. Through a voice input analysis system, the voice data is converted to text and sent to the server. The input is the user's voice, and the output is the textualized instructions.
[0626] Step 6:
[0627] The server analyzes the user's instructions and automatically executes the corresponding in-game actions. This process ensures that user instructions are quickly reflected in the game. Input is text instructions, and output is game actions.
[0628] Step 7:
[0629] The server analyzes user emotion data acquired from the terminal. Using emotion analysis tools, it infers the user's emotions from voice and facial expression data and adjusts the content of voice guidance and haptic feedback accordingly. The input is emotion data, and the output is the adjusted guidance and feedback.
[0630] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0631] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0632] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.
[0633] [Fourth Embodiment]
[0634] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.
[0635] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.
[0636] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0637] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.
[0638] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0639] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0640] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0641] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.
[0642] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0643] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0644] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0645] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0646] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0647] This invention is a system that enables visually impaired individuals to enjoy games in the same way as sighted individuals, and it consists of multiple functional modules. This system operates in cooperation with a server and terminals.
[0648] First, the server is equipped with image recognition capabilities to analyze real-time game screen data. These image recognition capabilities identify important elements within the game, such as characters, obstacles, and enemies, and digitize the position and status of each element. For example, in an action game, the server identifies the movement of enemies around the player character and determines which direction the user should focus their attention.
[0649] Next, the server uses the voice guide generation mechanism to generate voice guides appropriate to each situation based on the acquired data. These voice guides contain important information for the user and serve as a guide for action. For example, when an enemy approaches, specific instructions such as "Move to the left" are generated using speech synthesis.
[0650] This audio guide is transmitted to the device and provided to the user through an audio playback device. The user listens to the audio guide using a headset or similar device. Additionally, haptic feedback devices are activated as needed, providing intuitive information to the user through vibrations and other means. For example, when a specific action is required, the device uses vibration to provide feedback on the timing.
[0651] Furthermore, users input game commands by voice using voice input analysis via a terminal. When a user speaks instructions such as "attack" or "move right," the terminal recognizes them and sends the analysis results to the server. Based on this information, the server uses automated control mechanisms to immediately execute the corresponding action in the game.
[0652] This allows visually impaired users to understand the gameplay through audio and haptic feedback, without relying on visual information, and to have a direct and interactive experience. This system is particularly useful in situations where real-time responsiveness is required, such as in action games, as it can respond to rapidly changing game situations, thereby improving the user's gameplay experience.
[0653] The following describes the processing flow.
[0654] Step 1:
[0655] The server receives real-time screen data from the game. It captures the game's rendering frames and begins analysis using image recognition. Here, it identifies objects such as enemy characters, player characters, obstacles, and items, and extracts their positional information and movement patterns.
[0656] Step 2:
[0657] The server constructs an audio guide based on the analysis results. The audio guide generation system creates appropriate instructions according to the game situation and formats them as text data. For example, if an enemy is approaching from the right, it will create a guide that says, "An enemy is coming from the right. Please dodge." This text is then prepared to be converted into audio.
[0658] Step 3:
[0659] The server generates the audio guide and sends it to the terminal as a digital audio file. Here, text-to-speech technology is used to convert the text into audio data. The terminal receives this audio data and prepares to output the audio to the user.
[0660] Step 4:
[0661] The device transmits received audio guidance to the user using an audio playback device. The user listens to the audio guidance through a headset or speaker to understand the game situation. Simultaneously, haptic feedback devices generate vibrations to convey urgent information or prompt specific actions.
[0662] Step 5:
[0663] The user issues voice commands through the device. If the user commands a specific action, such as "jump" or "attack," the voice is recorded by the device.
[0664] Step 6:
[0665] The terminal analyzes the user's voice commands using voice input analysis technology. The analysis results are sent to the server as digital commands. This converts the voice content into specific game instructions.
[0666] Step 7:
[0667] Based on the received commands, the server sends instructions to the game engine via automated control mechanisms to execute corresponding actions. This allows the in-game characters to immediately perform actions such as "jump" or "attack."
[0668] Step 8:
[0669] The server and terminal work together, repeating these steps in real time, enabling users to continuously and intuitively control and enjoy the game.
[0670] (Example 1)
[0671] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0672] The challenge is to provide an environment where visually impaired individuals can enjoy games that require complex real-time interaction without relying on visual information. It is necessary to establish effective means for users who have difficulty directly obtaining visual information to grasp the game situation and take appropriate actions.
[0673] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0674] In this invention, the server includes data analysis means, information conversion means, and input analysis means. This converts visual information into audio and haptic feedback, enabling the user to play the game in real time.
[0675] "Data analysis means" refers to a device or software that has the function of acquiring visual information in real time, analyzing that information, and converting it into meaningful data.
[0676] An "information conversion means" is a device or software that generates information to be conveyed to the user based on analyzed data and provides it primarily in the form of sound or haptic feedback.
[0677] "Input analysis means" refers to a device or software that receives voice instructions from a user, analyzes their content, and converts them into a format suitable for control within the system.
[0678] "Control means" refers to a device or software that has the function of performing a specified action based on instructions from the user.
[0679] "Playback means" refers to a device or software that actually outputs the generated audio guide as audio.
[0680] A "feedback mechanism" is a device or software that has the function of providing feedback to the user through touch, conveying intuitive information to the user through vibrations or other means.
[0681] This invention is a system that enables visually impaired individuals to enjoy games without relying on their vision. The server and terminal work together to provide a comfortable interactive environment for the user.
[0682] The server acquires real-time screen data from the game and analyzes it using data analysis tools. Specifically, it uses image recognition software (e.g., a general-purpose library) to identify elements in the game such as characters, obstacles, and enemies. The obtained information is converted into audio guidance by an information conversion tool, and instructions are generated for the user using speech synthesis software such as Google Text-to-Speech. For example, if a character is in danger, it might create an audio guide saying, "Move to the left to avoid danger."
[0683] This audio guide is transmitted to the device and provided to the user through the device's playback mechanism. The user listens to it through an audio device such as a headset. The device also has a feedback mechanism that vibrates according to instructions from the server to communicate in-game timing and actions to the user. The device vibrates when a specific enemy approaches, allowing the user to intuitively understand which direction to focus their attention.
[0684] Users can give instructions to their devices using voice commands. These voice commands are analyzed by an input analysis system and converted into text data. This text is then sent to the server, where the required actions in the game are executed through an automated control function. For example, if a user says "jump," the server will immediately configure the system to reflect that command in the game.
[0685] This system allows users to enjoy games in real time. The generative AI model generates prompts tailored to user needs, improving operational efficiency and the user experience. An example of a prompt might be, "Designing a real-time audio-guided game system for the visually impaired."
[0686] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0687] Step 1:
[0688] The server acquires screen data from the game in real time. The input is screen information received through the game's API, and the output is image data that can be analyzed. The server receives this image data for use in the next step and stores it in a processing queue.
[0689] Step 2:
[0690] The server uses data analysis tools to analyze the acquired image data. The input is the image data acquired in step 1, and the output is elemental information such as characters, obstacles, and enemies in the game. An image recognition library (e.g., a general-purpose library) is used to identify each element and extract its position and movement as numerical data.
[0691] Step 3:
[0692] Based on the analysis results, the server generates an audio guide using an information conversion mechanism. The input is the elemental information obtained in step 2, and the output is the audio guide text provided to the user. The server uses speech synthesis software such as Google Text-to-Speech to generate specific instructions in audio format, such as "You are approaching an obstacle. Move to the left."
[0693] Step 4:
[0694] The terminal receives audio guidance transmitted from the server and provides it to the user via a playback mechanism. The input is audio guidance data from the server, and the output is the audio transmitted to the user. Here, the terminal plays the audio through the headset and, if necessary, uses its vibration function to convey information as haptic feedback.
[0695] Step 5:
[0696] The user inputs commands into the terminal by voice. The input is the user's voice instructions, and the output is command data in text format. The terminal analyzes this voice using an input parsing device and prepares to send it to the server. For example, a command such as "jump" is extracted from the voice.
[0697] Step 6:
[0698] The server receives the voice analysis results and executes in-game actions using its automated control function. The input is the text command data from step 5, and the output is the execution of the action in the game. The server processes this in real time and sends instructions to make the avatar move according to the user's commands.
[0699] (Application Example 1)
[0700] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0701] There is a need to provide means of support that enable visually impaired people to move more safely and efficiently in their daily lives. In particular, systems that can respond quickly to changes in the environment and obstacles are necessary.
[0702] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0703] In this invention, the server includes an image analysis means for analyzing environmental information in real time via a user interface, a voice instruction generation means for generating voice instructions based on the analyzed visual information, and a voice analysis means for receiving, analyzing, and converting voice instructions from the user into action commands. This makes it possible for visually impaired people to understand their surroundings and select appropriate actions.
[0704] A "user interface" is a means for a user to interact with a system, and is a physical or logical component for inputting and outputting information.
[0705] "Environmental information" refers to information that influences user behavior, such as the surrounding environment, the location of obstacles, and the movement path.
[0706] "Real-time analysis" refers to a processing method that immediately handles the ongoing situation and makes the results immediately available.
[0707] "Image analysis means" refers to technical means for analyzing visual data obtained from cameras and sensors and extracting important information.
[0708] "Voice instruction generation means" refers to a technical means for generating voice messages that instruct the user to take action based on analyzed data.
[0709] "Voice analysis means" refers to the process of analyzing voice commands received from a user and understanding their content.
[0710] "Automatic control means" refers to technical means that automatically manage and control the operation of a system or device based on user input and environmental information.
[0711] The system necessary to implement this invention primarily involves the user, server, and terminal each playing a specific role. The server uses a 360-degree camera and a LiDAR sensor to capture information about the surrounding environment in real time. This allows it to acquire visual information about the environment in which the user is located, and analyze this data using OpenCV. Based on the information obtained from the analysis, the voice instruction generation means generates voice instructions using a TTS (Text-to-Speech) engine.
[0712] The terminal is a device for presenting these generated voice instructions to the user. The terminal is equipped with sound playback means, which are used to clearly convey the generated voice instructions to the user. Therefore, the user can receive the information immediately. In addition, the terminal has haptic feedback means, which can provide tactile information to the user using a vibration motor or similar device.
[0713] Users can input voice commands through a terminal. A voice analysis system then analyzes the user's instructions and converts them into appropriate action commands. These commands are sent to a server, and an automated control system executes the appropriate actions based on that information. This enables visually impaired individuals to understand their environment and move quickly and safely.
[0714] A concrete example is a scenario where a visually impaired person needs to walk safely in a city. When the server scans the surroundings and detects that the sidewalk is under construction, the voice instruction generation system generates specific instructions such as "Take the narrow path to the right," helping the user to safely detour. In this way, information can be provided immediately, prompting action.
[0715] Examples of prompts include, "Please have the robot guide you to the optimal detour route in real time based on visual information," and "Please have the robot perform the necessary actions based on the user's voice commands." This allows for maximum utilization of the system's capabilities.
[0716] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0717] Step 1:
[0718] The server continuously captures information about the surrounding environment using a 360-degree camera and LiDAR sensors. This input data is sent to the server as visual data. The server uses OpenCV to analyze this data and identify important elements such as pedestrians, obstacles, and direction of movement. The analysis yields data about the position and movement of each element in the environment.
[0719] Step 2:
[0720] The server generates voice instructions using a TTS engine based on the analyzed visual information. For example, if it detects an obstacle ahead, it will generate a voice instruction such as "Move to the right." The input is the server's analyzed data, and the output is voice data presented to the user.
[0721] Step 3:
[0722] The terminal receives audio data transmitted from the server and uses an audio playback device to present voice instructions to the user. In this case, the terminal's role is to convey the voice instructions to the user in real time. The output is the voice instructions that the user actually hears.
[0723] Step 4:
[0724] The terminal simultaneously utilizes haptic feedback mechanisms, such as vibration motors, to provide tactile information to the user. This allows the terminal to warn the user of situations requiring particular attention. The input is a control command received from the server, and the output is the vibration action performed by the terminal.
[0725] Step 5:
[0726] The user makes decisions and chooses actions based on voice instructions, and issues voice commands to the terminal as needed. The user's voice commands are then sent back to the terminal as input.
[0727] Step 6:
[0728] The terminal receives voice input from the user and interprets its content using voice analysis tools. The analysis results are sent to the server. The input is the user's voice command, and the output is communication data to the server.
[0729] Step 7:
[0730] The server updates the system's operation and instructions using automated control mechanisms based on the analysis results. This adjusts the overall system behavior according to user requests. The input is voice analysis data from the terminal, and the output is the updated control commands.
[0731] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0732] This invention provides a system that enables visually impaired users to enjoy games in an immersive way, and in particular, by combining it with an emotion engine, it provides adaptive interaction that takes the user's emotions into consideration.
[0733] The system operates primarily through a server and a terminal. First, the server acquires real-time screen data from the game, analyzes it using image recognition, and identifies important elements within the game. Based on this analysis, an audio guide generation system creates situation-appropriate audio guides. The generated guides are converted from text to audio and sent from the server to the terminal.
[0734] The device provides the user with audio guidance obtained through audio playback means. The user can receive information via a headset or speaker. It also uses haptic feedback means to provide the user with feedback tailored to specific game situations. For example, if an enemy is approaching, the device will vibrate to warn the user.
[0735] Furthermore, the system incorporates an emotion engine, allowing the device to recognize emotions in real time through the user's facial expressions and tone of voice. The recognized emotion data is sent to a server, which then adjusts the content of the voice guide and the settings of the haptic feedback based on that information. For example, if the user is feeling stressed, the tone of the voice guide can be made calmer, or the feedback can be made gentler.
[0736] Users can give voice commands for in-game actions, and a voice input analysis system analyzes these commands and sends them to the server. The server then uses an automated system to execute the in-game actions based on this information. This enables interaction that takes user emotions into account, resulting in comfortable and intuitive gameplay. The system iterates through these processes in real time, responding flexibly to user emotions and game progress.
[0737] The following describes the processing flow.
[0738] Step 1:
[0739] The server captures real-time screen data from the game and begins analysis through image recognition. This process identifies characters, obstacles, background information, etc., within the game, and based on this, acquires location information and situational data.
[0740] Step 2:
[0741] The server activates the audio guide generation system based on the analyzed data and generates the necessary audio guides as text. These guides include practical instructions and feedback for the player. For example, instructions such as "There is an enemy at 3 o'clock" might be created.
[0742] Step 3:
[0743] The server converts the generated audio guide into audio data and sends it to the terminal as a digital audio file. The file is formatted for playback on the terminal.
[0744] Step 4:
[0745] The terminal provides the user with received audio data using an audio playback device. The user receives this data through a headset and can obtain verbal instructions. Simultaneously, haptic feedback devices provide additional information through vibrations as needed.
[0746] Step 5:
[0747] The device activates an emotion engine and analyzes the user's facial expressions and tone of voice to recognize the user's emotional state in real time. This information is used to further improve the user's gameplay experience.
[0748] Step 6:
[0749] The recognized emotion data is sent to a server, which then adjusts the tone and content of the voice guide based on that information. It also changes the intensity and pattern of haptic feedback as needed, customizing the experience to match the user's emotions.
[0750] Step 7:
[0751] The user inputs voice commands into the terminal. For example, instructions such as "jump" or "attack" are spoken aloud. The terminal receives the voice and converts it into text using a voice input analysis device.
[0752] Step 8:
[0753] The terminal sends the analyzed voice commands to the server, and upon receiving this information, the server uses automated control mechanisms to execute the corresponding actions within the game. The server's instructions enable the character's specific movements to be realized in real time.
[0754] Step 9:
[0755] By repeating these steps in sequence, users can intuitively control and enjoy the game while responding to emotional changes. An adaptive system based on user feedback ensures a comfortable gameplay experience.
[0756] (Example 2)
[0757] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0758] It is difficult for visually impaired users to effectively enjoy computer games that typically rely on visual interaction. In particular, they require interfaces that appropriately adjust according to the game's progress and the player's emotional state, but no existing systems can achieve this. Therefore, there is a need for a system that considers the user's emotions, provides real-time in-game information, and allows the user to operate it intuitively.
[0759] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0760] In this invention, the server includes an image recognition means for analyzing visual information in real time, a means for generating audio guides and performing emotion recognition based on the analyzed visual information, and a means for receiving and analyzing voice instructions from the user. This allows visually impaired users to immerse themselves in the game while reducing emotional burden, and enables the user interface to flexibly adapt to the user's instantaneous emotions and instructions.
[0761] A "user interface" refers to the means or methods by which a user interacts with a system or application.
[0762] "Visual information" refers to all data that can be represented as images or videos, and this includes characters, backgrounds, and actions within a game.
[0763] "Real-time" refers to processing or responding within the actual time frame, with virtually no delay.
[0764] "Image recognition means" refers to technology that detects specific features from digital images and videos and identifies them.
[0765] "Voice guide generation means" refers to the technology and equipment used to generate necessary voice guidance based on analyzed information.
[0766] "Emotion recognition means" refers to methods and technologies for identifying an emotional state by analyzing a user's facial expressions, tone of voice, and reactions.
[0767] "Voice input analysis means" refers to technology that receives voice instructions from a user, analyzes the content, and understands its meaning.
[0768] "Automated operation means" refers to devices or technologies that automatically perform certain procedures or operations based on input instructions.
[0769] "Audio playback means" refers to the technology or device used to play back generated audio to the user.
[0770] "Haptic feedback means" refers to technologies that provide users with tactile responses and stimuli, complementing the interaction experience.
[0771] This invention provides a system that enables visually impaired users to comfortably enjoy interactive computer games. Specific embodiments of this system are described below.
[0772] The system's core consists of servers and terminals.
[0773] First, the server acquires real-time screen data from the game. This uses image recognition algorithms (e.g., OpenCV, TensorFlow, etc.) to process visual information. The server analyzes the image using the image recognition means and identifies important elements within the game. Based on this analysis, the server uses an audio guide generation means to generate situational audio guides. Speech synthesis technology (e.g., Google Cloud Text-to-Speech) is used to generate the audio.
[0774] The generated audio guide is converted from text to audio and sent from the server to the terminal. The terminal then provides this to the user through an audio playback device. The audio is played back through a headset or speaker. The terminal also uses haptic feedback to provide feedback to the user according to specific game situations. For example, if an enemy is approaching, the terminal uses vibration to warn the user.
[0775] Furthermore, the system incorporates an emotion engine, allowing the device to recognize emotions in real time through the user's facial expressions and voice tone. This recognition utilizes facial recognition and voice analysis technologies (e.g., Face API, IBM Watson Tone Analyzer). The recognized emotion data is sent to a server, which uses this information to adaptively adjust the tone of the voice guide and haptic feedback.
[0776] For example, if a user is feeling stressed, the tone of the voice guidance can be softened, and the haptic feedback can be adjusted to be more gradual. Users can also give voice commands for in-game actions. These voice commands are analyzed by the device's voice input analysis system, and the instructions are sent to the server. The server then uses automated control systems to execute the in-game actions based on these commands.
[0777] A concrete example of a prompt message could be, "Generate an audio guide that matches the user's current emotional state to reduce the stress they experience." Following this prompt, the generating AI model provides an interface suitable for the user. As a result, implementing this system makes it possible for visually impaired users to enjoy an immersive gaming experience.
[0778] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0779] Step 1:
[0780] The server takes real-time game screen data as input. This input data is analyzed by image recognition algorithms (e.g., OpenCV, TensorFlow) to identify important elements within the game (such as enemy characters and items). The output is a list of the identified elements and their locations. This information is then used in the next step within the server.
[0781] Step 2:
[0782] The server uses the analysis results obtained in Step 1 as input to create an audio guide using an audio guide generation system. Specifically, it generates instructions and explanations in text format that correspond to the in-game situation, and then converts this into speech using speech synthesis technology (e.g., Google Cloud Text-to-Speech). The output is the generated audio data, which is then ready to be presented to the user.
[0783] Step 3:
[0784] The server processes the generated audio data for transmission to the terminal. The audio data is transmitted in real time over a stable network. The input is the audio data generated in step 2, and the output is the audio communication data delivered to the terminal.
[0785] Step 4:
[0786] The terminal receives audio data from the server as input and provides audio guidance to the user via an audio playback device (e.g., a headset or speaker). Specifically, the terminal triggers the start of audio playback and automatically adjusts the volume and sound quality as needed. The output is the audio guidance that the user hears.
[0787] Step 5:
[0788] The device uses vibration motors and other mechanisms to provide haptic feedback to the user, depending on the game situation. Specifically, it will vibrate when an enemy approaches, for example. The input is game situation information from the server, and the output is the haptic stimulus felt by the user.
[0789] Step 6:
[0790] The device uses the user's face and voice as input and analyzes their emotions in real time using emotion recognition technology. Specifically, it collects data using a face recognition camera and microphone, and identifies the user's emotional state using an analysis algorithm (e.g., Face API, voice tone analysis technology). The output is the user's emotion data, which is used in the next step.
[0791] Step 7:
[0792] Based on the emotional data obtained in step 6, the server dynamically adjusts the content of the voice guidance and haptic feedback. For example, if the user is stressed, the tone of the voice guidance may be softened or the vibrations may be reduced. The input is the user's emotional data, and the output is the adjusted interface settings.
[0793] Step 8:
[0794] The user controls in-game actions by inputting voice commands. The input voice commands are analyzed by the terminal's voice input analysis system and sent to the server. Based on this information, the server uses an automated control system to execute in-game actions. The output is the action actually performed in the game based on the user's instructions.
[0795] (Application Example 2)
[0796] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0797] For visually impaired users to enjoy games and entertainment with a sense of immersion, there is a need for technology that goes beyond simply supplementing audiovisual information and provides rich, interactive experiences that take into account the user's emotional state. However, current technologies do not make real-time adjustments in response to the user's emotions, resulting in a limited user experience.
[0798] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0799] In this invention, the server includes image recognition means for analyzing visual information in real time via a user interface, voice guide generation means for generating voice guides based on the analyzed visual information, and emotion analysis means for analyzing the user's emotional state and adjusting the voice guides and feedback based on the analysis results. This makes it possible to provide appropriate voice guides and haptic feedback according to the user's emotional state, realizing an immersive and intuitive interaction.
[0800] A "user interface" is a mechanism that provides a point of contact for users to interact with a system and exchange information.
[0801] "Visual information" refers to data related to vision, such as images and videos.
[0802] "Image recognition means" refers to technologies for automatically identifying specific features or objects from visual information.
[0803] A "voice guide generation means" is a method for providing instructions and guidance to a user through voice, based on analyzed visual information.
[0804] "Emotion analysis methods" refer to technologies that estimate a user's emotional state from their facial expressions and voice, and then analyze that information.
[0805] "Voice input analysis means" refers to technology that analyzes voice commands received from a user and converts them into an appropriate digital format.
[0806] An "automatic operation means" is a mechanism for automatically performing specific actions or operations based on voice instructions or other input information.
[0807] "Audio playback means" refers to devices or technologies that allow users to listen to the generated audio guide.
[0808] A "tactile feedback means" is a method of providing physical tactile stimuli to a user and transmitting information.
[0809] This invention is a system for visually impaired users to comfortably enjoy games and entertainment. The system consists of a server and terminals, which exchange information with each other to provide the user with an immersive experience. Specifically, the invention will be implemented in the following form.
[0810] Server Role
[0811] The server acquires visual information from external sources via the user interface and analyzes this information using image recognition. Based on this analysis, the voice guide generation system creates a situation-appropriate voice guide and sends it to the terminal. Furthermore, the server is equipped with emotion analysis capabilities, which analyze the user's voice and video sent from the terminal to understand the user's emotions. Based on this emotion data, it is possible to adjust the content of the voice guide and haptic feedback.
[0812] Terminal role
[0813] The device is equipped with audio playback and haptic feedback mechanisms, providing the user with audio guides sent from the server. The audio guides have a system that converts text to speech, delivering information to the user in real time. Furthermore, the device can provide feedback through the haptic feedback mechanism depending on specific game situations. For example, it can issue a warning using vibration when an enemy approaches.
[0814] User interaction
[0815] Users can input their voices as instructions into the device through a voice input analysis system. These instructions are sent to a server, and appropriate in-game actions are executed by an automated system. This allows users to experience flexible and intuitive interactions that respond to their emotional state.
[0816] Specific example
[0817] For example, if a user says to their device, "I want to play a fictional adventure game," the server receives this instruction and starts a suitable game. Also, if the user shows signs of anxiety, the server uses emotion analysis to detect this and sends a voice guide such as, "Take a deep breath and enjoy this scene."
[0818] Example of a prompt
[0819] "Please generate voice messages to provide emotionally resonant and gentle advice to visually impaired users."
[0820] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0821] Step 1:
[0822] The server acquires visual information from an external source. The acquired data is sent to an image recognition system to identify specific features or objects. The input is visual information, and the output is the identified in-game elements.
[0823] Step 2:
[0824] The server generates an audio guide based on the information identified in step 1. Using the audio guide generation means, it creates a guide in text format corresponding to the identified elements. This text data is converted into audio, and the generated audio is sent to the terminal. The input is the in-game elements, and the output is the audio guide.
[0825] Step 3:
[0826] The terminal provides the user with audio guides transmitted from the server via an audio playback device. The audio is played back through a speaker or headset, conveying information to the user in real time. The input is the audio guide, and the output is audio information.
[0827] Step 4:
[0828] The terminal activates haptic feedback mechanisms based on instructions from the server. This provides feedback to the user by generating vibrations or pressures appropriate to specific game situations. The input is a situation-specific feedback command, and the output is haptic information.
[0829] Step 5:
[0830] The user sends voice instructions to the terminal. Through a voice input analysis system, the voice data is converted to text and sent to the server. The input is the user's voice, and the output is the textualized instructions.
[0831] Step 6:
[0832] The server analyzes the user's instructions and automatically executes the corresponding in-game actions. This process ensures that user instructions are quickly reflected in the game. Input is text instructions, and output is game actions.
[0833] Step 7:
[0834] The server analyzes user emotion data acquired from the terminal. Using emotion analysis tools, it infers the user's emotions from voice and facial expression data and adjusts the content of voice guidance and haptic feedback accordingly. The input is emotion data, and the output is the adjusted guidance and feedback.
[0835] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0836] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0837] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.
[0838] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.
[0839] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.
[0840] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.
[0841] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.
[0842] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.
[0843] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."
[0844] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.
[0845] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.
[0846] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.
[0847] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.
[0848] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.
[0849] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.
[0850] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.
[0851] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.
[0852] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.
[0853] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.
[0854] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.
[0855] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.
[0856] The following is further disclosed regarding the embodiments described above.
[0857] (Claim 1)
[0858] An image recognition means that analyzes visual information in real time via a user interface,
[0859] Audio guide generation means that generates an audio guide based on analyzed visual information,
[0860] A voice input analysis means that receives voice instructions from the user, analyzes them, and converts them into in-game actions,
[0861] An automated control system that performs in-game actions based on voice commands,
[0862] A system that includes this.
[0863] (Claim 2)
[0864] The system according to claim 1, further comprising an audio playback means for presenting the generated audio guide to the user.
[0865] (Claim 3)
[0866] The system according to claim 1, comprising haptic feedback means for providing tactile feedback to the user.
[0867] "Example 1"
[0868] (Claim 1)
[0869] A data analysis means that analyzes visual information in real time via a user interface,
[0870] Information conversion means for generating an audio guide based on the analyzed information,
[0871] An input analysis means that receives voice instructions from the user, analyzes them, and converts them into actions,
[0872] A control means that performs an action based on voice instructions,
[0873] A system that includes this.
[0874] (Claim 2)
[0875] The system according to claim 1, further comprising a playback means for presenting the generated audio guide.
[0876] (Claim 3)
[0877] The system according to claim 1, comprising a feedback means for providing tactile feedback to the user.
[0878] "Application Example 1"
[0879] (Claim 1)
[0880] An image analysis means that analyzes environmental information in real time via a user interface,
[0881] A voice instruction generation means that generates voice instructions based on analyzed visual information,
[0882] A voice analysis means that receives voice instructions from the user, analyzes them, and converts them into action commands,
[0883] An automatic control means that automatically controls actions based on voice commands,
[0884] A system that includes this.
[0885] (Claim 2)
[0886] The system according to claim 1, further comprising an audio playback means for presenting generated voice instructions to the user.
[0887] (Claim 3)
[0888] The system according to claim 1, comprising haptic feedback means for providing tactile feedback to the user.
[0889] "Example 2 of combining an emotion engine"
[0890] (Claim 1)
[0891] An image recognition means that analyzes visual information in real time via a user interface,
[0892] Audio guide generation means that generates an audio guide based on analyzed visual information,
[0893] An emotion recognition means that recognizes and analyzes the user's emotional state,
[0894] A voice input analysis means that receives voice instructions from the user, analyzes them, and converts them into in-game actions,
[0895] An automated control system that performs in-game actions based on voice commands,
[0896] A system that includes this.
[0897] (Claim 2)
[0898] The system according to claim 1, further comprising an audio playback means for adjusting and presenting the generated audio guide according to the user's emotional state.
[0899] (Claim 3)
[0900] The system according to claim 1, comprising a haptic feedback means that provides tactile feedback to the user and adjusts the content of the feedback according to the user's emotional state.
[0901] "Application example 2 when combining with an emotional engine"
[0902] (Claim 1)
[0903] An image recognition means that analyzes visual information in real time via a user interface,
[0904] Audio guide generation means that generates an audio guide based on analyzed visual information,
[0905] An emotion analysis means that analyzes the user's emotional state and adjusts the voice guide and feedback based on the analysis results,
[0906] A voice input analysis means that receives voice instructions from the user, analyzes them, and converts them into in-game actions,
[0907] An automated control system that performs in-game actions based on voice commands,
[0908] A system that includes this.
[0909] (Claim 2)
[0910] The system according to claim 1, further comprising an audio playback means for presenting the generated audio guide to the user.
[0911] (Claim 3)
[0912] The system according to claim 1, comprising haptic feedback means that provides tactile feedback to the user and adjusts the feedback characteristics based on the user's emotional state. [Explanation of Symbols]
[0913] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>
Claims
1. An image analysis means that analyzes environmental information in real time via a user interface, A voice instruction generation means that generates voice instructions based on analyzed visual information, A voice analysis means that receives voice instructions from the user, analyzes them, and converts them into action commands, An automatic control means that automatically controls actions based on voice commands, A system that includes this.
2. The system according to claim 1, further comprising an audio playback means for presenting generated voice instructions to the user.
3. The system according to claim 1, comprising haptic feedback means for providing tactile feedback to the user.