system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A system that analyzes ambient sound data to generate sign language videos for drivers with hearing impairments, facilitating safe driving by visually indicating emergency vehicle approach and actions.

JP2026096484APending Publication Date: 2026-06-15SOFTBANK GROUP CORP

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: SOFTBANK GROUP CORP
Filing Date: 2024-12-03
Publication Date: 2026-06-15

AI Technical Summary

Technical Problem

Drivers with hearing impairments cannot hear the sirens of emergency vehicles, making it difficult to recognize their presence and direction, which can lead to delayed and inappropriate driving actions, compromising traffic safety and the rapid passage of emergency vehicles.

Method used

A system that acquires ambient sound data, analyzes it to identify emergency vehicle patterns, and generates sign language videos to visually instruct drivers on appropriate actions, displayed on the vehicle's screen.

Benefits of technology

Enables hearing-impaired drivers to quickly and accurately recognize the approach of emergency vehicles, allowing them to take appropriate driving actions without relying on hearing, thereby ensuring safe and smooth vehicle operation.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 2026096484000001_ABST

Patent Text Reader

Abstract

We provide the system. [Solution] A means for acquiring surrounding audio data, Analysis means for analyzing the aforementioned audio data to identify the audio pattern of a specific emergency vehicle, A generation means for generating a sign language video containing instruction information based on the identified voice pattern, A display means for displaying the aforementioned sign language video on the vehicle's display, A system that includes this.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0005]

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a method for controlling a persona chatbot performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of the chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] Drivers with hearing impairments cannot hear the sirens of emergency vehicles, so it is difficult to grasp their presence and approaching direction in case of an emergency. As a result, it may be delayed to take appropriate avoidance actions, which may hinder traffic safety and the rapid passage of emergency vehicles. In order to solve such problems, there is a need to provide a means to visually recognize the approach of an emergency vehicle without relying on hearing and to prompt quick and accurate actions.

Means for Solving the Problems

[0005] The present invention comprises an acquisition means for acquiring ambient sound data and an analysis means for analyzing the sound data to identify the sound pattern of an emergency vehicle. Furthermore, it provides a generation means for generating a sign language video containing appropriate instructions based on the identified sound pattern, and a display means for displaying the video on the vehicle's display. This configuration enables drivers with hearing impairments to visually recognize the approach of an emergency vehicle and respond quickly.

[0006] "Acquisition means" refers to devices or functions for collecting audio data from the external environment.

[0007] "Analysis means" refers to devices or algorithms used to process collected audio data and identify specific patterns or features.

[0008] "Generation means" refers to a device or program that creates sign language videos containing appropriate instructional information based on the results of analyzing audio data.

[0009] "Display means" refers to devices or functions for visually presenting the generated sign language video on a display inside the vehicle.

[0010] "Instructional information" refers to information designed to prompt drivers to take action in response to the approach of emergency vehicles, and consists of specific instructions provided through sign language videos.

[0011] "Emergency vehicles" are vehicles that use sirens and have priority on public roads, and include police cars, ambulances, and fire trucks.

[0012] A "sign language video" is a video created to visually represent instructions using sign language and to prompt drivers to take necessary actions.

[0013] "Approach direction" refers to information indicating the position from which the emergency vehicle is approaching the vehicle. [Brief explanation of the drawing]

[0014] [Figure 1] It is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] It is a conceptual diagram showing an example of the main functions of a data processing device and a smart device according to the first embodiment. [Figure 3] It is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which a plurality of emotions are mapped. [Figure 10] It shows an emotion map to which a plurality of emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Example 2 when an emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example ② when an emotion engine is combined.

Modes for Carrying Out the Invention

[0015] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0016] First, the terms used in the following description will be explained.

[0017] In the following embodiments, a tagged processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0018] In the following embodiments, a tagged RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0019] In the following embodiments, a tagged storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, etc.

[0020] In the following embodiments, the signed communication interface (I / F) is an interface that includes a communication processor and an antenna, etc. The communication interface manages communication between multiple computers. Examples of communication standards applicable to the communication interface include wireless communication standards such as 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).

[0021] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0022] [First Embodiment]

[0023] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0024] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0025] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0026] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0027] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0028] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0029] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0030] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0031] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0032] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0033] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0034] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0035] This invention is built as an assistance system to help drivers with hearing impairments recognize the approach of emergency vehicles and drive safely. The system is designed to function smoothly in actual anticipated use cases by having the terminal, server, and user each play their respective roles.

[0036] Terminal role:

[0037] The terminal first acquires ambient audio data using a microphone mounted in the vehicle. This audio data is collected in real time, buffered at regular timeframes, and then sent to the server. The terminal also displays sign language videos containing instructional information received from the server on the vehicle's display. This display provides the user with quick visual feedback, helping them to respond immediately in emergencies.

[0038] Server role:

[0039] The server receives audio data transmitted from the terminal and analyzes the audio waveform using signal processing techniques such as FFT. As a result of the analysis, it identifies audio patterns associated with a specific emergency vehicle and evaluates its approach direction. Based on this information, the server generates an appropriate sign language video. The generated sign language video includes instructional information based on the siren audio pattern and approach direction, prompting the driver to take specific action. The server transmits this sign language video to the terminal, enabling real-time display.

[0040] User roles:

[0041] The driver, as the user, checks the sign language video visually displayed on the terminal and decides on the appropriate driving action according to the current traffic situation. For example, if the video displays the instruction to "turn left," the driver can immediately change direction to avoid obstructing the passage of emergency vehicles. This system makes it possible to continue safe driving without relying on hearing.

[0042] In this way, through a series of processes including the acquisition of voice data on the terminal, voice analysis and sign language video generation on the server, and visual feedback to the user, the present invention provides a safe and smooth vehicle driving experience. For example, in a situation where an ambulance is approaching, a video instruction such as "Emergency vehicle approaching: Stop and yield the right of way" is displayed, allowing the user to respond to the situation safely and quickly.

[0043] The following describes the processing flow.

[0044] Step 1:

[0045] The terminal uses a microphone installed in the vehicle to acquire ambient audio data. The audio data is collected in real time and temporarily stored in a buffer within the terminal. At regular intervals, this audio data is converted into packet format and sent to the server.

[0046] Step 2:

[0047] The server receives audio data transmitted from the terminal. The received data is analyzed by a signal processing algorithm. Specifically, the frequency components of the audio signal are extracted using FFT (Fast Fourier Transform), and characteristic frequency patterns corresponding to emergency vehicle sirens are detected.

[0048] Step 3:

[0049] The server identifies the approach of an emergency vehicle based on the detected frequency pattern. It also evaluates the direction of the audio signal to determine the direction of the approaching emergency vehicle (e.g., left, right, rear). Based on this information, it determines the instructions to provide to the user.

[0050] Step 4:

[0051] The server generates a sign language video corresponding to the selected instruction. This video is formatted using machine learning algorithms and existing video libraries to include appropriate instruction information (e.g., turn left, stop). The sign language video is then encoded and converted into a transmittable data format.

[0052] Step 5:

[0053] The terminal receives sign language video data transmitted from the server. It verifies the integrity of the video data and prepares it for display on the vehicle's display. When displayed, it is positioned appropriately so as not to obstruct the driver's view.

[0054] Step 6:

[0055] The driver, as the user, visually confirms the sign language video displayed on the terminal's screen. Based on the information presented, and considering the current traffic conditions and instructions on the display, they decide on appropriate driving actions. This enables safe and swift driving.

[0056] (Example 1)

[0057] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0058] Drivers with hearing impairments may have difficulty auditorily detecting the approach of emergency vehicles, which can hinder them from taking appropriate driving actions. Therefore, there is a need to recognize the approach of emergency vehicles in a way that does not rely on hearing, and to provide appropriate driving assistance.

[0059] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0060] In this invention, the server includes acquisition means for acquiring ambient sounds, analysis means for analyzing the sounds to identify the voice characteristics of a specific emergency vehicle, and generation means for generating sign language expressions including instruction information based on the identified voice characteristics. This enables drivers with hearing impairments to visually recognize the approach of an emergency vehicle and take appropriate driving actions.

[0061] "Acquisition means" refers to devices and methods for collecting ambient sounds, such as microphones, to obtain audio data.

[0062] "Analysis means" refers to a device or method for processing collected audio data and extracting information according to a specific purpose, and includes the step of identifying audio features.

[0063] "Generation means" refers to devices and methods for creating visual information based on the analyzed results, and specifically includes the process of generating sign language expressions.

[0064] "Display means" refers to devices or methods for outputting generated visual information and presenting it to the user for confirmation, and primarily involves the use of displays.

[0065] "Direction evaluation means" refers to devices or methods for determining the position and direction of an approaching emergency vehicle, and may utilize differences in sound arrival time or changes in volume.

[0066] "Speech features" refer to the characteristics of patterns and signals contained within audio data, and are used as information to identify a specific sound source.

[0067] This invention provides a system installed in a vehicle to visually notify the user of the approach of an emergency vehicle, in order to assist drivers with hearing impairments. The following describes the configuration for implementing the system.

[0068] Device configuration:

[0069] The terminal is installed inside the vehicle and includes a series of hardware devices and software programs. Specifically, it uses a microphone to collect ambient sounds. The acquired audio is stored in an internal buffer. This audio data is transmitted to a server using wireless communication technology. In addition, the terminal is equipped with a high-resolution display that shows sign language expressions sent from the server.

[0070] Server functions:

[0071] The server analyzes the audio data received from the terminal. Here, it analyzes the audio waveform using signal processing techniques such as FFT (Fast Fourier Transform) to identify audio features related to emergency vehicles. Furthermore, it evaluates the time difference in sound arrival to estimate the direction of approach. Based on these analysis results, the server generates sign language expressions using a generative AI model. In the generation process, the prompt message used is "Generate sign language videos corresponding to siren patterns."

[0072] User actions:

[0073] The driver, as the user, visually confirms the sign language expression displayed on the terminal's screen and takes appropriate driving actions according to the traffic situation. For example, if the displayed instruction is "pull over to the right and stop," the user will move the vehicle to the right to give priority to emergency vehicles.

[0074] This system will enable drivers with hearing impairments to properly recognize approaching emergency vehicles and continue driving safely.

[0075] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0076] Step 1:

[0077] The terminal acquires ambient sound data using a microphone mounted on the vehicle. This microphone senses various sounds received from outside the vehicle and stores them as audio signals in the terminal's buffer. The input is ambient sound, and the output is the stored digital audio data.

[0078] Step 2:

[0079] The terminal processes the stored voice data and transmits it to the server via wireless communication. In this transmission process, the voice data is packetized and efficiently delivered to the server over the network. The input is the voice data stored in the terminal, and the output is the packets of voice data sent to the server.

[0080] Step 3:

[0081] The server analyzes the received audio data and identifies specific audio features using FFT and peak detection algorithms. The data analysis performed here aims to convert the audio signal from the time domain to the frequency domain and extract feature patterns associated with emergency vehicle sirens. The input is the audio data sent to the server, and the output is the identified audio features and their associated information.

[0082] Step 4:

[0083] The server generates sign language expressions using a generative AI model based on the analysis results of speech features. In this process, the analysis results are used as prompts to stimulate the AI model, which then outputs a sign language video containing appropriate action instructions for the driver. This video will have content corresponding to the siren pattern and the direction of approach. The input is a prompt based on speech features, and the output is video data of the generated sign language expression.

[0084] Step 5:

[0085] The server sends the generated sign language video to the terminal. The terminal receives this video data and displays it on the in-vehicle display. This allows the user to visually recognize the approach of an emergency vehicle and act according to the instructions. The input is the sign language video data sent from the server, and the output is the sign language expression displayed on the terminal's display.

[0086] Step 6:

[0087] The user sees the sign language indication displayed on the device's screen and decides on the appropriate driving action based on the current situation. For example, if "move to the right" is displayed, the user follows the direction signal and maneuvers the vehicle appropriately to allow emergency vehicles to pass. The input is the information displayed on the device, and the output is the user's driving action.

[0088] (Application Example 1)

[0089] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0090] There is a problem in that hearing-impaired drivers have difficulty visually recognizing approaching emergency vehicles and responding appropriately. In particular, with autonomous vehicles, the lack of visual information provided to the occupants makes it difficult for hearing-impaired drivers to respond to emergencies.

[0091] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0092] In this invention, the server includes information gathering means for acquiring ambient acoustic data, analysis means for analyzing the acoustic data to identify the acoustic pattern of a specific emergency mobility device, and information generation means for generating visual information including operation instruction information based on the identified acoustic pattern. This enables the provision of visual information that allows hearing-impaired individuals to safely respond to emergencies inside an autonomous vehicle.

[0093] "Surroundings" refers to the environment and circumstances that exist outside the moving object in question.

[0094] "Audio data" refers to information acquired as sound or acoustic signals.

[0095] "Information gathering means" refers to devices and methods used to acquire necessary data.

[0096] "Specific emergency means of transportation" refers to vehicles used in emergencies, such as ambulances and police vehicles.

[0097] An "acoustic pattern" refers to the characteristic waveform or signal arrangement of a particular sound or audio.

[0098] "Analysis means" refers to devices and methods used to analyze acquired data and extract necessary information.

[0099] "Action instruction information" refers to information that indicates what action should be taken towards the target.

[0100] "Visual information" refers to information provided in a format that humans can see and recognize.

[0101] "Information generation means" refers to devices or methods that create necessary information based on analysis results.

[0102] "Information output means" refers to devices or methods for presenting generated information to users.

[0103] A system implementing this invention includes a program for analyzing ambient acoustic data and providing visual information to the user based on that data.

[0104] Server Role

[0105] The server begins by acquiring acoustic data. To fulfill this role, the server uses analytical tools to analyze the acoustic data. This analysis employs signal processing techniques such as FFT (Fast Fourier Transform). By analyzing the acoustic data, the server identifies acoustic patterns associated with specific emergency mobility devices and evaluates their approach direction. The data that forms the basis of the acoustic data is collected using speech recognition libraries such as Google® Cloud Speech-to-Text API.

[0106] Terminal role

[0107] The terminal displays visual information received from the server to the user. This visual information is in a visually easily recognizable format and provides the user with action instructions. Information is conveyed visually through the terminal's display, showing specific instructions such as "turn left" or "stop and yield." Machine learning models such as TENSORFLOW® are used to generate this information.

[0108] User roles

[0109] Users review the visual information displayed on the device and take appropriate action according to the provided instructions. This plays a crucial role in helping them make sound judgments while driving. Visual information enables safe driving without relying on hearing.

[0110] Specific example

[0111] For example, when an ambulance is approaching, a message such as "Emergency vehicle approaching: Stop and yield the right of way" will be displayed. This allows the user to respond safely and quickly.

[0112] Example of a Generated AI Model Prompt

[0113] "Please describe the process of generating a sign language video and displaying it on the screen when an ambulance siren is detected."

[0114] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0115] Step 1:

[0116] The server receives ambient acoustic data from the terminal. This acoustic data is acquired by microphones mounted on the vehicle and transmitted to the server. The server receives raw audio waveform data as input and uses this data as output for the next analysis step.

[0117] Step 2:

[0118] The server uses FFT (Fast Fourier Transform) to analyze the received acoustic data. The input is the acoustic data obtained in step 1, and the frequency components are analyzed by applying FFT to it. The output is the frequency spectrum, which serves as the basis for identifying the acoustic patterns of specific emergency vehicles. This analysis allows for the detection of specific acoustic patterns of emergency vehicles.

[0119] Step 3:

[0120] Based on the analysis results, the server determines the acoustic pattern of the identified emergency vehicle and the corresponding operational instructions based on its approach direction. The input for this step is the acoustic pattern and approach direction identified in step 2, and the output is information containing appropriate operational instructions. A machine learning model based on historical data is used to generate this information.

[0121] Step 4:

[0122] The terminal receives visual information transmitted from the server. The input is visual information, including operational instructions, generated by the server, and the output is the visual presentation of that information on the display. The terminal notifies the user by displaying this information on the screen.

[0123] Step 5:

[0124] The user confirms the visual information displayed on the terminal's screen and takes appropriate driving actions based on the presented instructions. The input is the visual instructions displayed by the terminal, and the output is the user's driving actions based on that information. This allows the user to continue driving safely and consciously.

[0125] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0126] This invention provides a system that enables drivers with hearing impairments to recognize the approach of emergency vehicles and to provide support tailored to the user's emotional state. The system is primarily built around a terminal, server, and user, and incorporates an emotion engine to achieve more personalized and effective support.

[0127] Terminal role:

[0128] The device acquires ambient audio data through microphones installed in the vehicle. It can also utilize facial recognition and biometric sensors to understand the user's emotional state. This information is used to adjust the system to ensure safe and comfortable driving. The collected data is sent to a server for analysis.

[0129] Server role:

[0130] The server analyzes audio data transmitted from the terminal to identify the audio patterns of emergency vehicles. It uses FFT to analyze the frequency components of the audio signal and detect specific audio patterns. This information is also useful in determining the direction from which the emergency vehicle is approaching. Furthermore, the server uses an emotion engine to assess the user's emotional state. This emotional state is used to adjust the content and tone of the generated sign language video, providing instructions appropriate to the user's mental state.

[0131] User roles:

[0132] The driver, as the user, checks sign language videos displayed on the vehicle's screen via a terminal. For example, if the system detects that the user is experiencing stress, a sign language video with more detailed instructions in a gentle and positive tone is provided. Based on this feedback, the user can choose appropriate driving actions in response to the approach of an emergency vehicle.

[0133] In this way, a system that incorporates an emotion engine provides more flexible and effective driving assistance according to the driver's individual state. In this embodiment, by comprehensively analyzing the user's emotions and the surrounding sound environment, it is possible to reduce the driver's psychological burden and create a more comfortable and safe driving environment. For example, if the emotion engine detects that the user is tense, it can provide appropriate instructions such as, "Calm down and turn left."

[0134] The following describes the processing flow.

[0135] Step 1:

[0136] The terminal uses microphones and cameras mounted on the vehicle, or biometric sensors, to acquire ambient audio data and user emotion data. Audio data is used to detect sirens, and emotion data is obtained from the user's facial expressions and heart rate. This data is temporarily stored in a buffer and prepared to be sent to the server.

[0137] Step 2:

[0138] The device transmits collected voice and emotion data to the server. Data transmission is performed in real time and packetized with high precision for detailed analysis.

[0139] Step 3:

[0140] The server analyzes the audio data transmitted from the terminal and extracts frequency components using FFT. This allows it to identify the siren of an emergency vehicle and determine its approaching direction. The direction is determined using the intensity and phase difference of the audio.

[0141] Step 4:

[0142] The server uses an emotion engine to evaluate the user's emotional state. Based on the fed facial expression data and biometric information, it determines whether the user is relaxed or stressed. This allows it to identify the optimal instructions for the user's emotions.

[0143] Step 5:

[0144] The server generates appropriate sign language videos based on the results of voice analysis and emotion assessment. For example, if an emergency vehicle is approaching and the user is feeling anxious, the server will prepare a video that includes gentle instructions such as "Stay calm and change direction."

[0145] Step 6:

[0146] The terminal displays sign language videos received from the server on a screen inside the vehicle. This display is positioned in a location easily accessible to the user and configured to ensure visibility. Because it is updated in real time, timely information can be provided.

[0147] Step 7:

[0148] The user checks the sign language video displayed on the screen. Following the instructions, they decide on driving actions and respond to emergency vehicles. Emotionally sensitive instructions allow the user to continue driving with greater confidence.

[0149] (Example 2)

[0150] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0151] For drivers with hearing impairments, it is crucial to quickly and accurately recognize approaching emergency vehicles and respond safely. However, conventional technologies are limited to simple analysis of acoustic signals and do not provide individualized support based on the user's emotional state. As a result, even when users are experiencing stress or anxiety, only uniform information can be provided, which hinders effective driving assistance.

[0152] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0153] In this invention, the server includes an analysis means for analyzing acoustic data to identify the acoustic pattern of an emergency vehicle, a state determination means for evaluating the user's emotional state, and a generation means for generating sign language video based on the identified acoustic pattern and the evaluated emotional state. This makes it possible to provide flexible and optimal driving assistance information according to the user's emotional state.

[0154] "Acquisition means" refers to a function that uses equipment installed in the vehicle to collect acoustic data from the surrounding environment.

[0155] The "analysis means" refers to a function that analyzes acquired acoustic data and identifies the acoustic pattern of a specific emergency vehicle.

[0156] A "state determination means" is a function that evaluates the user's current emotional state based on the user's facial expressions and biometric information.

[0157] The "generation means" is a function for creating sign language video containing appropriate instructional information based on identified sound patterns and the user's emotional state.

[0158] "Display means" refers to a function for displaying the generated sign language video on a visual display device inside the vehicle.

[0159] The "direction estimation means" is a function for estimating the direction of approach of an emergency vehicle from the analyzed acoustic pattern.

[0160] This invention is a system designed to enable drivers with hearing impairments to recognize the approach of emergency vehicles and receive appropriate driving assistance. Its specific configuration and implementation method are described below.

[0161] Terminal role:

[0162] The device includes microphones and cameras installed in the vehicle, and through this hardware, it collects ambient acoustic data, user facial expressions, and biometric data. For example, the device can acquire external siren sounds in real time using the microphone, and analyze the user's emotional state using the camera and heart rate sensor. The collected data is transmitted to a server using a secure communication method.

[0163] Server role:

[0164] The server analyzes the acoustic data transmitted from the terminal using algorithms such as FFT (Fast Fourier Transform) to identify the acoustic patterns of emergency vehicles. Furthermore, the server uses an emotion engine to evaluate the user's emotional state from the received facial expression data and biosignals. The necessary software for this is acoustic analysis software and an emotion analysis module. For example, FFT is used to extract specific frequency components and detect siren sounds. Deep learning techniques using AI models can be used for emotion analysis.

[0165] Generation means:

[0166] The server then uses a generation AI model to generate appropriate sign language videos in response to the approaching emergency vehicle. The generated videos include instructions tailored to the user's emotional state. Specifically, a user who is feeling anxious will receive instructions in a gentle tone to help them calm down. An example of a prompt might be, "Emergency vehicle approaching. The user is feeling anxious. Please generate a sign language video with gentle instructions to help them calm down."

[0167] User roles:

[0168] The user views sign language video displayed inside the vehicle via a terminal and selects driving actions based on it. This video includes information such as the direction of approaching emergency vehicles and specific instructions that take the user's emotions into consideration. This allows the user to respond quickly and with confidence.

[0169] The implementation of this system will reduce the burden on drivers and support safe and smooth driving.

[0170] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0171] Step 1:

[0172] The terminal acquires ambient acoustic data using microphones installed in the vehicle. Specifically, it collects acoustic data in real time and temporarily stores it in a buffer. The input for this step is the ambient acoustic environment, and the output is the acoustic data prepared for use.

[0173] Step 2:

[0174] The device acquires the user's facial expressions and biometric data using cameras and heart rate sensors installed inside the vehicle. Specifically, it captures facial expressions with the camera and measures heart rate with the heart rate sensor. The input for this step is the user's facial expressions and biometric signals, and the output is facial expression data and biometric data prepared for analysis.

[0175] Step 3:

[0176] The terminal transmits the collected acoustic data, user facial expression data, and biometric data to the server. Specifically, it encrypts the data using a secure communication protocol and sends it to the server. The input for this step is the acoustic data and the user's facial expression and biometric data, and the output is the data that has reached the server.

[0177] Step 4:

[0178] The server uses FFT (Fast Fourier Transform) to analyze acoustic data and identify the acoustic patterns of emergency vehicles. Specifically, it analyzes the frequency components of the acoustic data to identify siren sounds. The input to this step is the acoustic data sent to the server, and the output is the identified acoustic pattern of the emergency vehicle.

[0179] Step 5:

[0180] The server uses an emotion engine to analyze the user's facial expression data and biometric data to evaluate the user's emotional state. Specifically, it uses a deep learning-based AI model to determine the emotional state. The input for this step is the user's facial expression data and biometric data, and the output is the determined emotional state of the user.

[0181] Step 6:

[0182] The server uses a generative AI model to generate sign language videos based on acoustic patterns and emotional states. Specifically, prompts are used to instruct the AI model, generating sign language videos that meet the specified conditions. The input for this step is the identified acoustic patterns and evaluated emotional states, and the output is the generated sign language video.

[0183] Step 7:

[0184] The terminal displays sign language video received from the server on the in-car display. Specifically, it uses the display's video playback function to provide the video to the user. The input for this step is the sign language video sent from the server, and the output is the visualized sign language video.

[0185] Step 8:

[0186] The user reviews the displayed sign language video and selects appropriate driving actions based on it. Specifically, they operate the car according to the information provided. The input for this step is the sign language video displayed on the terminal, and the output is the user's specific driving actions.

[0187] (Application Example 2)

[0188] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0189] There is a problem in that drivers with hearing impairments have difficulty quickly recognizing the approach of emergency vehicles and taking appropriate action based on their emotional state. Furthermore, conventional systems provide standardized warnings based only on audio information, which means they cannot respond in a way that is tailored to the emotional state of individual drivers.

[0190] The identification processing by the identification processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means. In this invention, the server includes means for acquiring ambient acoustic information, means for analyzing the acoustic information to identify the voice pattern of a specific emergency vehicle, and means for generating visual display data including instruction information based on the identified voice pattern and the user's emotional state. This allows drivers with hearing impairments to visually confirm the approach of an emergency vehicle and to provide a personalized response according to their emotional state at that time.

[0191] "Ambient acoustic information" refers to all sound data acquired from the vehicle's surrounding environment, including emergency vehicle sirens and road noise.

[0192] "Acquisition means" refers to the function of capturing ambient acoustic information using sensors and microphones inside the vehicle.

[0193] The "analysis means" has the function of analyzing acoustic information obtained through the acquisition means and processing it to identify specific sound patterns of emergency vehicles.

[0194] "Generation means" refers to a function that creates visual display data and provides appropriate instructions based on information identified by the analysis means and the user's emotional state.

[0195] "Display means" refers to a function that displays the generated visual display data on a display inside the vehicle or on a similar device to visually communicate information to the driver.

[0196] "Emotional state" refers to the driver's psychological or physiological state, including internal conditions such as stress and tension.

[0197] "Visual display data" refers to digital displays that include instructions and warnings presented to the driver as visual information.

[0198] The system implementing this invention primarily consists of a server and a terminal. The terminal includes a microphone and sensors mounted on the vehicle, used to acquire ambient acoustic information and the driver's emotional state. This information is transmitted to the server via data communication. On the server, an audio analysis algorithm is first used to identify the siren sound of an emergency vehicle from the ambient acoustic information. This can be done using a speech recognition API such as Google Cloud Speech-to-Text.

[0199] Furthermore, to assess the driver's emotional state, biometric information acquired from the device is analyzed using an emotion recognition API such as Affectiva. This makes it possible to generate appropriate instructions based on the driver's stress level and attention level. The generated instruction information is provided as visual display data on the in-vehicle display. The system informs the driver of approaching emergency vehicles and provides action instructions tailored to their current psychological state through visual feedback.

[0200] For example, suppose a driver is driving on a highway. This system constantly monitors the surrounding sounds in the background, and when it detects an emergency vehicle siren, it takes into account the driver's emotional state and visually displays instructions such as "Please remain calm and drive slowly."

[0201] An example of a prompt to the generating AI model is, "Suggest what kind of AR notification would be appropriate to alert the user to the approach of an emergency vehicle, based on the user's emotional state." In this way, this invention provides a system that compensates for the driver's hearing impairment and improves safe driving assistance.

[0202] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0203] Step 1:

[0204] The terminal acquires ambient acoustic information using a microphone mounted on the vehicle. This input data is sent to the server as a raw audio signal.

[0205] Step 2:

[0206] The server uses FFT (Fast Fourier Transform) to decompose the received audio signal data into frequency components and identifies the siren sound of an emergency vehicle. The output provides information on the presence and approach direction of the emergency vehicle.

[0207] Step 3:

[0208] The device uses the driver's facial recognition and biometric sensors to acquire emotional information for the day. This information is sent to a server as data indicating the driver's stress level and concentration level.

[0209] Step 4:

[0210] The server utilizes an emotion recognition API, such as Affectiva, to analyze the received emotional state data. Based on the emotion evaluation, it determines the optimal driving instructions. The output of this process is the specific instructions.

[0211] Step 5:

[0212] The server generates visual display data, creating information that includes instructions best suited to the driver's current situation and emotional state. The generated visual display data is sent to the terminal.

[0213] Step 6:

[0214] The terminal displays the received visual data on the vehicle's display. By visually confirming this, the driver can understand the approach of an emergency vehicle and take appropriate driving actions, and then take countermeasures.

[0215] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0216] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0217] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0218] [Second Embodiment]

[0219] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0220] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0221] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0222] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0223] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0224] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0225] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0226] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0227] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0228] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0229] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0230] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0231] This invention is built as an assistance system to help drivers with hearing impairments recognize the approach of emergency vehicles and drive safely. The system is designed to function smoothly in actual anticipated use cases by having the terminal, server, and user each play their respective roles.

[0232] Terminal role:

[0233] The terminal first acquires ambient audio data using a microphone mounted in the vehicle. This audio data is collected in real time, buffered at regular timeframes, and then sent to the server. The terminal also displays sign language videos containing instructional information received from the server on the vehicle's display. This display provides the user with quick visual feedback, helping them to respond immediately in emergencies.

[0234] Server role:

[0235] The server receives audio data transmitted from the terminal and analyzes the audio waveform using signal processing techniques such as FFT. As a result of the analysis, it identifies audio patterns associated with a specific emergency vehicle and evaluates its approach direction. Based on this information, the server generates an appropriate sign language video. The generated sign language video includes instructional information based on the siren audio pattern and approach direction, prompting the driver to take specific action. The server transmits this sign language video to the terminal, enabling real-time display.

[0236] User roles:

[0237] The driver, as the user, checks the sign language video visually displayed on the terminal and decides on the appropriate driving action according to the current traffic situation. For example, if the video displays the instruction to "turn left," the driver can immediately change direction to avoid obstructing the passage of emergency vehicles. This system makes it possible to continue safe driving without relying on hearing.

[0238] In this way, through a series of processes including the acquisition of voice data on the terminal, voice analysis and sign language video generation on the server, and visual feedback to the user, the present invention provides a safe and smooth vehicle driving experience. For example, in a situation where an ambulance is approaching, a video instruction such as "Emergency vehicle approaching: Stop and yield the right of way" is displayed, allowing the user to respond to the situation safely and quickly.

[0239] The following describes the processing flow.

[0240] Step 1:

[0241] The terminal uses a microphone installed in the vehicle to acquire ambient audio data. The audio data is collected in real time and temporarily stored in a buffer within the terminal. At regular intervals, this audio data is converted into packet format and sent to the server.

[0242] Step 2:

[0243] The server receives audio data transmitted from the terminal. The received data is analyzed by a signal processing algorithm. Specifically, the frequency components of the audio signal are extracted using FFT (Fast Fourier Transform), and characteristic frequency patterns corresponding to emergency vehicle sirens are detected.

[0244] Step 3:

[0245] The server identifies the approach of an emergency vehicle based on the detected frequency pattern. It also evaluates the direction of the audio signal to determine the direction of the approaching emergency vehicle (e.g., left, right, rear). Based on this information, it determines the instructions to provide to the user.

[0246] Step 4:

[0247] The server generates a sign language video corresponding to the selected instruction. This video is formatted using machine learning algorithms and existing video libraries to include appropriate instruction information (e.g., turn left, stop). The sign language video is then encoded and converted into a transmittable data format.

[0248] Step 5:

[0249] The terminal receives sign language video data transmitted from the server. It verifies the integrity of the video data and prepares it for display on the vehicle's display. When displayed, it is positioned appropriately so as not to obstruct the driver's view.

[0250] Step 6:

[0251] The driver, as the user, visually confirms the sign language video displayed on the terminal's screen. Based on the information presented, and considering the current traffic conditions and instructions on the display, they decide on appropriate driving actions. This enables safe and swift driving.

[0252] (Example 1)

[0253] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0254] Drivers with hearing impairments may have difficulty auditorily detecting the approach of emergency vehicles, which can hinder them from taking appropriate driving actions. Therefore, there is a need to recognize the approach of emergency vehicles in a way that does not rely on hearing, and to provide appropriate driving assistance.

[0255] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0256] In this invention, the server includes acquisition means for acquiring ambient sounds, analysis means for analyzing the sounds to identify the voice characteristics of a specific emergency vehicle, and generation means for generating sign language expressions including instruction information based on the identified voice characteristics. This enables drivers with hearing impairments to visually recognize the approach of an emergency vehicle and take appropriate driving actions.

[0257] "Acquisition means" refers to devices and methods for collecting ambient sounds, such as microphones, to obtain audio data.

[0258] "Analysis means" refers to a device or method for processing collected audio data and extracting information according to a specific purpose, and includes the step of identifying audio features.

[0259] "Generation means" refers to devices and methods for creating visual information based on the analyzed results, and specifically includes the process of generating sign language expressions.

[0260] "Display means" refers to devices or methods for outputting generated visual information and presenting it to the user for confirmation, and primarily involves the use of displays.

[0261] "Direction evaluation means" refers to devices or methods for determining the position and direction of an approaching emergency vehicle, and may utilize differences in sound arrival time or changes in volume.

[0262] "Speech features" refer to the characteristics of patterns and signals contained within audio data, and are used as information to identify a specific sound source.

[0263] This invention provides a system installed in a vehicle to visually notify the user of the approach of an emergency vehicle, in order to assist drivers with hearing impairments. The following describes the configuration for implementing the system.

[0264] Device configuration:

[0265] The terminal is installed inside the vehicle and includes a series of hardware devices and software programs. Specifically, it uses a microphone to collect ambient sounds. The acquired audio is stored in an internal buffer. This audio data is transmitted to a server using wireless communication technology. In addition, the terminal is equipped with a high-resolution display that shows sign language expressions sent from the server.

[0266] Server functions:

[0267] The server analyzes the audio data received from the terminal. Here, it analyzes the audio waveform using signal processing techniques such as FFT (Fast Fourier Transform) to identify audio features related to emergency vehicles. Furthermore, it evaluates the time difference in sound arrival to estimate the direction of approach. Based on these analysis results, the server generates sign language expressions using a generative AI model. In the generation process, the prompt message used is "Generate sign language videos corresponding to siren patterns."

[0268] User actions:

[0269] The driver, as the user, visually confirms the sign language expression displayed on the terminal's screen and takes appropriate driving actions according to the traffic situation. For example, if the displayed instruction is "pull over to the right and stop," the user will move the vehicle to the right to give priority to emergency vehicles.

[0270] This system will enable drivers with hearing impairments to properly recognize approaching emergency vehicles and continue driving safely.

[0271] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0272] Step 1:

[0273] The terminal acquires ambient sound data using a microphone mounted on the vehicle. This microphone senses various sounds received from outside the vehicle and stores them as audio signals in the terminal's buffer. The input is ambient sound, and the output is the stored digital audio data.

[0274] Step 2:

[0275] The terminal processes the stored voice data and transmits it to the server via wireless communication. In this transmission process, the voice data is packetized and efficiently delivered to the server over the network. The input is the voice data stored in the terminal, and the output is the packets of voice data sent to the server.

[0276] Step 3:

[0277] The server analyzes the received voice data and identifies specific voice features using FFT and peak detection algorithms. The data analysis performed here aims to convert the voice signal from the time domain to the frequency domain and extract the feature patterns related to the sirens of emergency vehicles. The input is the voice data sent to the server, and the output is the identified voice features and their related information.

[0278] Step 4:

[0279] Based on the analysis results of the voice features, the server generates sign language expressions using a generative AI model. In this process, the analysis results are used as prompt texts to stimulate the AI model, and a sign language video containing appropriate action instructions for the driver is output. This video will have content corresponding to the siren pattern and approaching direction. The input is the prompt text based on the voice features, and the output is the video data of the generated sign language expressions.

[0280] Step 5:

[0281] The server sends the generated sign language video to the terminal. The terminal receives this video data and displays it on the in-vehicle display. As a result, the user can visually recognize the approach of the emergency vehicle and act according to the instructions. The input is the sign language video data sent from the server, and the output is the sign language expression displayed on the terminal display.

[0282] Step 6:

[0283] The user looks at the sign language expression displayed on the terminal display and determines appropriate driving actions according to the current situation. For example, if "Avoid to the right" is displayed, the user follows the direction instruction and operates the vehicle appropriately to ensure the passage of the emergency vehicle. The input is the display information on the terminal, and the output is the user's driving action.

[0284] (Application Example 1)

[0285] Next, Application Example 1 will be described. In the following description, the data processing device 12 is referred to as a "server", and the smart glasses 214 are referred to as a "terminal".

[0286] When a hearing-impaired person drives, there is a problem that it is difficult to visually recognize the approach of an emergency vehicle and respond appropriately. In particular, in an autonomous vehicle, there is a problem that it is difficult for a hearing-impaired person to respond to an emergency because there is insufficient visual information provided to the passengers.

[0287] The specific processing by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0288] In this invention, the server includes information collection means for acquiring ambient acoustic data, analysis means for analyzing the acoustic data to identify the acoustic pattern of a specific emergency moving means, and information generation means for generating visual information including operation instruction information based on the identified acoustic pattern. Thereby, it is possible to provide visual information that enables a hearing-impaired person to safely respond to an emergency situation inside an autonomous vehicle.

[0289] "Surroundings" refers to the environment and situation existing outside the target moving body.

[0290] "Acoustic data" refers to information acquired as voice or acoustic signals.

[0291] "Information collection means" refers to a device or method for acquiring necessary data.

[0292] "Specific emergency moving means" refers to moving equipment such as ambulances and police vehicles used in emergencies.

[0293] "Acoustic pattern" refers to the characteristic waveform or signal arrangement of a specific voice or sound.

[0294] "Analysis means" refers to devices and methods used to analyze acquired data and extract necessary information.

[0295] "Action instruction information" refers to information that indicates what action should be taken towards the target.

[0296] "Visual information" refers to information provided in a format that humans can see and recognize.

[0297] "Information generation means" refers to devices or methods that create necessary information based on analysis results.

[0298] "Information output means" refers to devices or methods for presenting generated information to users.

[0299] A system implementing this invention includes a program for analyzing ambient acoustic data and providing visual information to the user based on that data.

[0300] Server Role

[0301] The server begins by acquiring acoustic data. To fulfill this role, the server uses analytical tools to analyze the acoustic data. This analysis employs signal processing techniques such as FFT (Fast Fourier Transform). By analyzing the acoustic data, the server identifies acoustic patterns associated with specific emergency mobility devices and evaluates their approach direction. The data that forms the basis of the acoustic data is collected using speech recognition libraries such as the Google Cloud Speech-to-Text API.

[0302] Terminal role

[0303] The terminal displays the visual information received from the server to the user. The visual information is in a form that is easy to visually identify and provides operation instruction information to the user. The information is visually transmitted through the terminal's display, and specific instructions such as "Turn left" or "Stop and yield" are projected. A machine learning model such as TensorFlow is used for information generation.

[0304] Role of the user

[0305] The user checks the visual information displayed on the terminal and takes appropriate actions according to the provided instructions. This plays an important role in assisting appropriate judgment during driving. With visual information, safe driving is possible without relying on hearing.

[0306] Specific example

[0307] As a specific example, when an ambulance is approaching, instruction information such as "Emergency vehicle approaching: Stop and yield" is displayed. This enables the user to respond safely and quickly.

[0308] Example of the generated AI model prompt text

[0309] "Please describe the process of generating a sign language video and displaying it on the display when the siren of an ambulance is recognized."

[0310] The flow of the specific process in Application Example 1 will be described using FIG. 12.

[0311] Step 1:

[0312] The server receives ambient acoustic data from the terminal. The acoustic data is acquired by a microphone mounted on the vehicle and transmitted to the server. As input, raw audio waveform data is received, and as output, the data is used in the next analysis step. <The server uses FFT (Fast Fourier Transform) to analyze the received acoustic data. The input is the acoustic data obtained in step 1, and the frequency components are analyzed by applying FFT to it. The output is the frequency spectrum, which serves as the basis for identifying the acoustic patterns of specific emergency vehicles. This analysis allows for the detection of specific acoustic patterns of emergency vehicles.

[0315] Step 3:

[0316] Based on the analysis results, the server determines the acoustic pattern of the identified emergency vehicle and the corresponding operational instructions based on its approach direction. The input for this step is the acoustic pattern and approach direction identified in step 2, and the output is information containing appropriate operational instructions. A machine learning model based on historical data is used to generate this information.

[0317] Step 4:

[0318] The terminal receives visual information transmitted from the server. The input is visual information, including operational instructions, generated by the server, and the output is the visual presentation of that information on the display. The terminal notifies the user by displaying this information on the screen.

[0319] Step 5:

[0320] The user confirms the visual information displayed on the terminal's screen and takes appropriate driving actions based on the presented instructions. The input is the visual instructions displayed by the terminal, and the output is the user's driving actions based on that information. This allows the user to continue driving safely and consciously.

[0321] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0322] This invention provides a system that enables drivers with hearing impairments to recognize the approach of emergency vehicles and to provide support tailored to the user's emotional state. The system is primarily built around a terminal, server, and user, and incorporates an emotion engine to achieve more personalized and effective support.

[0323] Terminal role:

[0324] The device acquires ambient audio data through microphones installed in the vehicle. It can also utilize facial recognition and biometric sensors to understand the user's emotional state. This information is used to adjust the system to ensure safe and comfortable driving. The collected data is sent to a server for analysis.

[0325] Server role:

[0326] The server analyzes audio data transmitted from the terminal to identify the audio patterns of emergency vehicles. It uses FFT to analyze the frequency components of the audio signal and detect specific audio patterns. This information is also useful in determining the direction from which the emergency vehicle is approaching. Furthermore, the server uses an emotion engine to assess the user's emotional state. This emotional state is used to adjust the content and tone of the generated sign language video, providing instructions appropriate to the user's mental state.

[0327] User roles:

[0328] The driver, as the user, checks sign language videos displayed on the vehicle's screen via a terminal. For example, if the system detects that the user is experiencing stress, a sign language video with more detailed instructions in a gentle and positive tone is provided. Based on this feedback, the user can choose appropriate driving actions in response to the approach of an emergency vehicle.

[0329] In this way, a system that incorporates an emotion engine provides more flexible and effective driving assistance according to the driver's individual state. In this embodiment, by comprehensively analyzing the user's emotions and the surrounding sound environment, it is possible to reduce the driver's psychological burden and create a more comfortable and safe driving environment. For example, if the emotion engine detects that the user is tense, it can provide appropriate instructions such as, "Calm down and turn left."

[0330] The following describes the processing flow.

[0331] Step 1:

[0332] The terminal uses microphones and cameras mounted on the vehicle, or biometric sensors, to acquire ambient audio data and user emotion data. Audio data is used to detect sirens, and emotion data is obtained from the user's facial expressions and heart rate. This data is temporarily stored in a buffer and prepared to be sent to the server.

[0333] Step 2:

[0334] The device transmits collected voice and emotion data to the server. Data transmission is performed in real time and packetized with high precision for detailed analysis.

[0335] Step 3:

[0336] The server analyzes the audio data transmitted from the terminal and extracts frequency components using FFT. This allows it to identify the siren of an emergency vehicle and determine its approaching direction. The direction is determined using the intensity and phase difference of the audio.

[0337] Step 4:

[0338] The server uses an emotion engine to evaluate the user's emotional state. Based on the fed facial expression data and biometric information, it determines whether the user is relaxed or stressed. This allows it to identify the optimal instructions for the user's emotions.

[0339] Step 5:

[0340] The server generates appropriate sign language videos based on the results of voice analysis and emotion assessment. For example, if an emergency vehicle is approaching and the user is feeling anxious, the server will prepare a video that includes gentle instructions such as "Stay calm and change direction."

[0341] Step 6:

[0342] The terminal displays sign language videos received from the server on a screen inside the vehicle. This display is positioned in a location easily accessible to the user and configured to ensure visibility. Because it is updated in real time, timely information can be provided.

[0343] Step 7:

[0344] The user checks the sign language video displayed on the screen. Following the instructions, they decide on driving actions and respond to emergency vehicles. Emotionally sensitive instructions allow the user to continue driving with greater confidence.

[0345] (Example 2)

[0346] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0347] For drivers with hearing impairments, it is crucial to quickly and accurately recognize approaching emergency vehicles and respond safely. However, conventional technologies are limited to simple analysis of acoustic signals and do not provide individualized support based on the user's emotional state. As a result, even when users are experiencing stress or anxiety, only uniform information can be provided, which hinders effective driving assistance.

[0348] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0349] In this invention, the server includes an analysis means for analyzing acoustic data to identify the acoustic pattern of an emergency vehicle, a state determination means for evaluating the user's emotional state, and a generation means for generating sign language video based on the identified acoustic pattern and the evaluated emotional state. This makes it possible to provide flexible and optimal driving assistance information according to the user's emotional state.

[0350] "Acquisition means" refers to a function that uses equipment installed in the vehicle to collect acoustic data from the surrounding environment.

[0351] The "analysis means" refers to a function that analyzes acquired acoustic data and identifies the acoustic pattern of a specific emergency vehicle.

[0352] A "state determination means" is a function that evaluates the user's current emotional state based on the user's facial expressions and biometric information.

[0353] The "generation means" is a function for creating sign language video containing appropriate instructional information based on identified sound patterns and the user's emotional state.

[0354] "Display means" refers to a function for displaying the generated sign language video on a visual display device inside the vehicle.

[0355] The "direction estimation means" is a function for estimating the direction of approach of an emergency vehicle from the analyzed acoustic pattern.

[0356] This invention is a system designed to enable drivers with hearing impairments to recognize the approach of emergency vehicles and receive appropriate driving assistance. Its specific configuration and implementation method are described below.

[0357] Terminal role:

[0358] The device includes microphones and cameras installed in the vehicle, and through this hardware, it collects ambient acoustic data, user facial expressions, and biometric data. For example, the device can acquire external siren sounds in real time using the microphone, and analyze the user's emotional state using the camera and heart rate sensor. The collected data is transmitted to a server using a secure communication method.

[0359] Server role:

[0360] The server analyzes the acoustic data transmitted from the terminal using algorithms such as FFT (Fast Fourier Transform) to identify the acoustic patterns of emergency vehicles. Furthermore, the server uses an emotion engine to evaluate the user's emotional state from the received facial expression data and biosignals. The necessary software for this is acoustic analysis software and an emotion analysis module. For example, FFT is used to extract specific frequency components and detect siren sounds. Deep learning techniques using AI models can be used for emotion analysis.

[0361] Generation means:

[0362] The server then uses a generation AI model to generate appropriate sign language videos in response to the approaching emergency vehicle. The generated videos include instructions tailored to the user's emotional state. Specifically, a user who is feeling anxious will receive instructions in a gentle tone to help them calm down. An example of a prompt might be, "Emergency vehicle approaching. The user is feeling anxious. Please generate a sign language video with gentle instructions to help them calm down."

[0363] User roles:

[0364] The user views sign language video displayed inside the vehicle via a terminal and selects driving actions based on it. This video includes information such as the direction of approaching emergency vehicles and specific instructions that take the user's emotions into consideration. This allows the user to respond quickly and with confidence.

[0365] The implementation of this system will reduce the burden on drivers and support safe and smooth driving.

[0366] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0367] Step 1:

[0368] The terminal acquires ambient acoustic data using microphones installed in the vehicle. Specifically, it collects acoustic data in real time and temporarily stores it in a buffer. The input for this step is the ambient acoustic environment, and the output is the acoustic data prepared for use.

[0369] Step 2:

[0370] The device acquires the user's facial expressions and biometric data using cameras and heart rate sensors installed inside the vehicle. Specifically, it captures facial expressions with the camera and measures heart rate with the heart rate sensor. The input for this step is the user's facial expressions and biometric signals, and the output is facial expression data and biometric data prepared for analysis.

[0371] Step 3:

[0372] The terminal transmits the collected acoustic data, user facial expression data, and biometric data to the server. Specifically, it encrypts the data using a secure communication protocol and sends it to the server. The input for this step is the acoustic data and the user's facial expression and biometric data, and the output is the data that has reached the server.

[0373] Step 4:

[0374] The server uses FFT (Fast Fourier Transform) to analyze acoustic data and identify the acoustic patterns of emergency vehicles. Specifically, it analyzes the frequency components of the acoustic data to identify siren sounds. The input to this step is the acoustic data sent to the server, and the output is the identified acoustic pattern of the emergency vehicle.

[0375] Step 5:

[0376] The server uses an emotion engine to analyze the user's facial expression data and biometric data to evaluate the user's emotional state. Specifically, it uses a deep learning-based AI model to determine the emotional state. The input for this step is the user's facial expression data and biometric data, and the output is the determined emotional state of the user.

[0377] Step 6:

[0378] The server uses a generative AI model to generate sign language videos based on acoustic patterns and emotional states. Specifically, prompts are used to instruct the AI model, generating sign language videos that meet the specified conditions. The input for this step is the identified acoustic patterns and evaluated emotional states, and the output is the generated sign language video.

[0379] Step 7:

[0380] The terminal displays sign language video received from the server on the in-car display. Specifically, it uses the display's video playback function to provide the video to the user. The input for this step is the sign language video sent from the server, and the output is the visualized sign language video.

[0381] Step 8:

[0382] The user reviews the displayed sign language video and selects appropriate driving actions based on it. Specifically, they operate the car according to the information provided. The input for this step is the sign language video displayed on the terminal, and the output is the user's specific driving actions.

[0383] (Application Example 2)

[0384] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0385] There is a problem in that drivers with hearing impairments have difficulty quickly recognizing the approach of emergency vehicles and taking appropriate action based on their emotional state. Furthermore, conventional systems provide standardized warnings based only on audio information, which means they cannot respond in a way that is tailored to the emotional state of individual drivers.

[0386] The identification processing by the identification processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means. In this invention, the server includes means for acquiring ambient acoustic information, means for analyzing the acoustic information to identify the voice pattern of a specific emergency vehicle, and means for generating visual display data including instruction information based on the identified voice pattern and the user's emotional state. This allows drivers with hearing impairments to visually confirm the approach of an emergency vehicle and to provide a personalized response according to their emotional state at that time.

[0387] "Ambient acoustic information" refers to all sound data acquired from the vehicle's surrounding environment, including emergency vehicle sirens and road noise.

[0388] "Acquisition means" refers to the function of capturing ambient acoustic information using sensors and microphones inside the vehicle.

[0389] The "analysis means" has the function of analyzing acoustic information obtained through the acquisition means and processing it to identify specific sound patterns of emergency vehicles.

[0390] "Generation means" refers to a function that creates visual display data and provides appropriate instructions based on information identified by the analysis means and the user's emotional state.

[0391] "Display means" refers to a function that displays the generated visual display data on a display inside the vehicle or on a similar device to visually communicate information to the driver.

[0392] "Emotional state" refers to the driver's psychological or physiological state, including internal conditions such as stress and tension.

[0393] "Visual display data" refers to digital displays that include instructions and warnings presented to the driver as visual information.

[0394] The system implementing this invention primarily consists of a server and a terminal. The terminal includes a microphone and sensors mounted on the vehicle, used to acquire ambient acoustic information and the driver's emotional state. This information is transmitted to the server via data communication. On the server, an audio analysis algorithm is first used to identify the siren sound of an emergency vehicle from the ambient acoustic information. This can be done using a speech recognition API such as Google Cloud Speech-to-Text.

[0395] Furthermore, to assess the driver's emotional state, biometric information acquired from the device is analyzed using an emotion recognition API such as Affectiva. This makes it possible to generate appropriate instructions based on the driver's stress level and attention level. The generated instruction information is provided as visual display data on the in-vehicle display. The system informs the driver of approaching emergency vehicles and provides action instructions tailored to their current psychological state through visual feedback.

[0396] For example, suppose a driver is driving on a highway. This system constantly monitors the surrounding sounds in the background, and when it detects an emergency vehicle siren, it takes into account the driver's emotional state and visually displays instructions such as "Please remain calm and drive slowly."

[0397] An example of a prompt to the generating AI model is, "Suggest what kind of AR notification would be appropriate to alert the user to the approach of an emergency vehicle, based on the user's emotional state." In this way, this invention provides a system that compensates for the driver's hearing impairment and improves safe driving assistance.

[0398] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0399] Step 1:

[0400] The terminal acquires ambient acoustic information using a microphone mounted on the vehicle. This input data is sent to the server as a raw audio signal.

[0401] Step 2:

[0402] The server uses FFT (Fast Fourier Transform) to decompose the received audio signal data into frequency components and identifies the siren sound of an emergency vehicle. The output provides information on the presence and approach direction of the emergency vehicle.

[0403] Step 3:

[0404] The device uses the driver's facial recognition and biometric sensors to acquire emotional information for the day. This information is sent to a server as data indicating the driver's stress level and concentration level.

[0405] Step 4:

[0406] The server utilizes an emotion recognition API, such as Affectiva, to analyze the received emotional state data. Based on the emotion evaluation, it determines the optimal driving instructions. The output of this process is the specific instructions.

[0407] Step 5:

[0408] The server generates visual display data, creating information that includes instructions best suited to the driver's current situation and emotional state. The generated visual display data is sent to the terminal.

[0409] Step 6:

[0410] The terminal displays the received visual data on the vehicle's display. By visually confirming this, the driver can understand the approach of an emergency vehicle and take appropriate driving actions, and then take countermeasures.

[0411] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0412] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0413] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0414] [Third Embodiment]

[0415] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0416] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0417] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0418] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0419] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0420] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0421] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0422] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0423] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0424] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0425] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0426] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0427] This invention is built as an assistance system to help drivers with hearing impairments recognize the approach of emergency vehicles and drive safely. The system is designed to function smoothly in actual anticipated use cases by having the terminal, server, and user each play their respective roles.

[0428] Terminal role:

[0429] The terminal first acquires ambient audio data using a microphone mounted in the vehicle. This audio data is collected in real time, buffered at regular timeframes, and then sent to the server. The terminal also displays sign language videos containing instructional information received from the server on the vehicle's display. This display provides the user with quick visual feedback, helping them to respond immediately in emergencies.

[0430] Server role:

[0431] The server receives audio data transmitted from the terminal and analyzes the audio waveform using signal processing techniques such as FFT. As a result of the analysis, it identifies audio patterns associated with a specific emergency vehicle and evaluates its approach direction. Based on this information, the server generates an appropriate sign language video. The generated sign language video includes instructional information based on the siren audio pattern and approach direction, prompting the driver to take specific action. The server transmits this sign language video to the terminal, enabling real-time display.

[0432] User roles:

[0433] The driver, as the user, checks the sign language video visually displayed on the terminal and decides on the appropriate driving action according to the current traffic situation. For example, if the video displays the instruction to "turn left," the driver can immediately change direction to avoid obstructing the passage of emergency vehicles. This system makes it possible to continue safe driving without relying on hearing.

[0434] In this way, through a series of processes including the acquisition of voice data on the terminal, voice analysis and sign language video generation on the server, and visual feedback to the user, the present invention provides a safe and smooth vehicle driving experience. For example, in a situation where an ambulance is approaching, a video instruction such as "Emergency vehicle approaching: Stop and yield the right of way" is displayed, allowing the user to respond to the situation safely and quickly.

[0435] The following describes the processing flow.

[0436] Step 1:

[0437] The terminal uses a microphone installed in the vehicle to acquire ambient audio data. The audio data is collected in real time and temporarily stored in a buffer within the terminal. At regular intervals, this audio data is converted into packet format and sent to the server.

[0438] Step 2:

[0439] The server receives audio data transmitted from the terminal. The received data is analyzed by a signal processing algorithm. Specifically, the frequency components of the audio signal are extracted using FFT (Fast Fourier Transform), and characteristic frequency patterns corresponding to emergency vehicle sirens are detected.

[0440] Step 3:

[0441] The server identifies the approach of an emergency vehicle based on the detected frequency pattern. It also evaluates the direction of the audio signal to determine the direction of the approaching emergency vehicle (e.g., left, right, rear). Based on this information, it determines the instructions to provide to the user.

[0442] Step 4:

[0443] The server generates a sign language video corresponding to the selected instruction. This video is formatted using machine learning algorithms and existing video libraries to include appropriate instruction information (e.g., turn left, stop). The sign language video is then encoded and converted into a transmittable data format.

[0444] Step 5:

[0445] The terminal receives sign language video data transmitted from the server. It verifies the integrity of the video data and prepares it for display on the vehicle's display. When displayed, it is positioned appropriately so as not to obstruct the driver's view.

[0446] Step 6:

[0447] The driver, as the user, visually confirms the sign language video displayed on the terminal's screen. Based on the information presented, and considering the current traffic conditions and instructions on the display, they decide on appropriate driving actions. This enables safe and swift driving.

[0448] (Example 1)

[0449] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0450] Drivers with hearing impairments may have difficulty auditorily detecting the approach of emergency vehicles, which can hinder them from taking appropriate driving actions. Therefore, there is a need to recognize the approach of emergency vehicles in a way that does not rely on hearing, and to provide appropriate driving assistance.

[0451] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0452] In this invention, the server includes acquisition means for acquiring ambient sounds, analysis means for analyzing the sounds to identify the voice characteristics of a specific emergency vehicle, and generation means for generating sign language expressions including instruction information based on the identified voice characteristics. This enables drivers with hearing impairments to visually recognize the approach of an emergency vehicle and take appropriate driving actions.

[0453] "Acquisition means" refers to devices and methods for collecting ambient sounds, such as microphones, to obtain audio data.

[0454] "Analysis means" refers to a device or method for processing collected audio data and extracting information according to a specific purpose, and includes the step of identifying audio features.

[0455] "Generation means" refers to devices and methods for creating visual information based on the analyzed results, and specifically includes the process of generating sign language expressions.

[0456] "Display means" refers to devices or methods for outputting generated visual information and presenting it to the user for confirmation, and primarily involves the use of displays.

[0457] "Direction evaluation means" refers to devices or methods for determining the position and direction of an approaching emergency vehicle, and may utilize differences in sound arrival time or changes in volume.

[0458] "Speech features" refer to the characteristics of patterns and signals contained within audio data, and are used as information to identify a specific sound source.

[0459] This invention provides a system installed in a vehicle to visually notify the user of the approach of an emergency vehicle, in order to assist drivers with hearing impairments. The following describes the configuration for implementing the system.

[0460] Device configuration:

[0461] The terminal is installed inside the vehicle and includes a series of hardware devices and software programs. Specifically, it uses a microphone to collect ambient sounds. The acquired audio is stored in an internal buffer. This audio data is transmitted to a server using wireless communication technology. In addition, the terminal is equipped with a high-resolution display that shows sign language expressions sent from the server.

[0462] Server functions:

[0463] The server analyzes the audio data received from the terminal. Here, it analyzes the audio waveform using signal processing techniques such as FFT (Fast Fourier Transform) to identify audio features related to emergency vehicles. Furthermore, it evaluates the time difference in sound arrival to estimate the direction of approach. Based on these analysis results, the server generates sign language expressions using a generative AI model. In the generation process, the prompt message used is "Generate sign language videos corresponding to siren patterns."

[0464] User actions:

[0465] The driver, as the user, visually confirms the sign language expression displayed on the terminal's screen and takes appropriate driving actions according to the traffic situation. For example, if the displayed instruction is "pull over to the right and stop," the user will move the vehicle to the right to give priority to emergency vehicles.

[0466] This system will enable drivers with hearing impairments to properly recognize approaching emergency vehicles and continue driving safely.

[0467] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0468] Step 1:

[0469] The terminal acquires ambient sound data using a microphone mounted on the vehicle. This microphone senses various sounds received from outside the vehicle and stores them as audio signals in the terminal's buffer. The input is ambient sound, and the output is the stored digital audio data.

[0470] Step 2:

[0471] The terminal processes the stored voice data and transmits it to the server via wireless communication. In this transmission process, the voice data is packetized and efficiently delivered to the server over the network. The input is the voice data stored in the terminal, and the output is the packets of voice data sent to the server.

[0472] Step 3:

[0473] The server analyzes the received audio data and identifies specific audio features using FFT and peak detection algorithms. The data analysis performed here aims to convert the audio signal from the time domain to the frequency domain and extract feature patterns associated with emergency vehicle sirens. The input is the audio data sent to the server, and the output is the identified audio features and their associated information.

[0474] Step 4:

[0475] The server generates sign language expressions using a generative AI model based on the analysis results of speech features. In this process, the analysis results are used as prompts to stimulate the AI model, which then outputs a sign language video containing appropriate action instructions for the driver. This video will have content corresponding to the siren pattern and the direction of approach. The input is a prompt based on speech features, and the output is video data of the generated sign language expression.

[0476] Step 5:

[0477] The server sends the generated sign language video to the terminal. The terminal receives this video data and displays it on the in-vehicle display. This allows the user to visually recognize the approach of an emergency vehicle and act according to the instructions. The input is the sign language video data sent from the server, and the output is the sign language expression displayed on the terminal's display.

[0478] Step 6:

[0479] The user sees the sign language indication displayed on the device's screen and decides on the appropriate driving action based on the current situation. For example, if "move to the right" is displayed, the user follows the direction signal and maneuvers the vehicle appropriately to allow emergency vehicles to pass. The input is the information displayed on the device, and the output is the user's driving action.

[0480] (Application Example 1)

[0481] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0482] There is a problem in that hearing-impaired drivers have difficulty visually recognizing approaching emergency vehicles and responding appropriately. In particular, with autonomous vehicles, the lack of visual information provided to the occupants makes it difficult for hearing-impaired drivers to respond to emergencies.

[0483] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0484] In this invention, the server includes information gathering means for acquiring ambient acoustic data, analysis means for analyzing the acoustic data to identify the acoustic pattern of a specific emergency mobility device, and information generation means for generating visual information including operation instruction information based on the identified acoustic pattern. This enables the provision of visual information that allows hearing-impaired individuals to safely respond to emergencies inside an autonomous vehicle.

[0485] "Surroundings" refers to the environment and circumstances that exist outside the moving object in question.

[0486] "Audio data" refers to information acquired as sound or acoustic signals.

[0487] "Information gathering means" refers to devices and methods used to acquire necessary data.

[0488] "Specific emergency means of transportation" refers to vehicles used in emergencies, such as ambulances and police vehicles.

[0489] An "acoustic pattern" refers to the characteristic waveform or signal arrangement of a particular sound or audio.

[0490] "Analysis means" refers to devices and methods used to analyze acquired data and extract necessary information.

[0491] "Action instruction information" refers to information that indicates what action should be taken towards the target.

[0492] "Visual information" refers to information provided in a format that humans can see and recognize.

[0493] "Information generation means" refers to devices or methods that create necessary information based on analysis results.

[0494] "Information output means" refers to devices or methods for presenting generated information to users.

[0495] A system implementing this invention includes a program for analyzing ambient acoustic data and providing visual information to the user based on that data.

[0496] Server Role

[0497] The server begins by acquiring acoustic data. To fulfill this role, the server uses analytical tools to analyze the acoustic data. This analysis employs signal processing techniques such as FFT (Fast Fourier Transform). By analyzing the acoustic data, the server identifies acoustic patterns associated with specific emergency mobility devices and evaluates their approach direction. The data that forms the basis of the acoustic data is collected using speech recognition libraries such as the Google Cloud Speech-to-Text API.

[0498] Terminal role

[0499] The terminal displays visual information received from the server to the user. This visual information is in a visually easily recognizable format and provides the user with action instructions. Information is conveyed visually through the terminal's display, showing specific instructions such as "turn left" or "stop and yield." Machine learning models like TensorFlow are used to generate this information.

[0500] User roles

[0501] Users review the visual information displayed on the device and take appropriate action according to the provided instructions. This plays a crucial role in helping them make sound judgments while driving. Visual information enables safe driving without relying on hearing.

[0502] Specific example

[0503] For example, when an ambulance is approaching, a message such as "Emergency vehicle approaching: Stop and yield the right of way" will be displayed. This allows the user to respond safely and quickly.

[0504] Example of a Generated AI Model Prompt

[0505] "Please describe the process of generating a sign language video and displaying it on the screen when an ambulance siren is detected."

[0506] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0507] Step 1:

[0508] The server receives ambient acoustic data from the terminal. This acoustic data is acquired by microphones mounted on the vehicle and transmitted to the server. The server receives raw audio waveform data as input and uses this data as output for the next analysis step.

[0509] Step 2:

[0510] The server uses FFT (Fast Fourier Transform) to analyze the received acoustic data. The input is the acoustic data obtained in step 1, and the frequency components are analyzed by applying FFT to it. The output is the frequency spectrum, which serves as the basis for identifying the acoustic patterns of specific emergency vehicles. This analysis allows for the detection of specific acoustic patterns of emergency vehicles.

[0511] Step 3:

[0512] Based on the analysis results, the server determines the acoustic pattern of the identified emergency vehicle and the corresponding operational instructions based on its approach direction. The input for this step is the acoustic pattern and approach direction identified in step 2, and the output is information containing appropriate operational instructions. A machine learning model based on historical data is used to generate this information.

[0513] Step 4:

[0514] The terminal receives visual information transmitted from the server. The input is visual information, including operational instructions, generated by the server, and the output is the visual presentation of that information on the display. The terminal notifies the user by displaying this information on the screen.

[0515] Step 5:

[0516] The user confirms the visual information displayed on the terminal's screen and takes appropriate driving actions based on the presented instructions. The input is the visual instructions displayed by the terminal, and the output is the user's driving actions based on that information. This allows the user to continue driving safely and consciously.

[0517] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0518] This invention provides a system that enables drivers with hearing impairments to recognize the approach of emergency vehicles and to provide support tailored to the user's emotional state. The system is primarily built around a terminal, server, and user, and incorporates an emotion engine to achieve more personalized and effective support.

[0519] Terminal role:

[0520] The device acquires ambient audio data through microphones installed in the vehicle. It can also utilize facial recognition and biometric sensors to understand the user's emotional state. This information is used to adjust the system to ensure safe and comfortable driving. The collected data is sent to a server for analysis.

[0521] Server role:

[0522] The server analyzes audio data transmitted from the terminal to identify the audio patterns of emergency vehicles. It uses FFT to analyze the frequency components of the audio signal and detect specific audio patterns. This information is also useful in determining the direction from which the emergency vehicle is approaching. Furthermore, the server uses an emotion engine to assess the user's emotional state. This emotional state is used to adjust the content and tone of the generated sign language video, providing instructions appropriate to the user's mental state.

[0523] User roles:

[0524] The driver, as the user, checks sign language videos displayed on the vehicle's screen via a terminal. For example, if the system detects that the user is experiencing stress, a sign language video with more detailed instructions in a gentle and positive tone is provided. Based on this feedback, the user can choose appropriate driving actions in response to the approach of an emergency vehicle.

[0525] In this way, a system that incorporates an emotion engine provides more flexible and effective driving assistance according to the driver's individual state. In this embodiment, by comprehensively analyzing the user's emotions and the surrounding sound environment, it is possible to reduce the driver's psychological burden and create a more comfortable and safe driving environment. For example, if the emotion engine detects that the user is tense, it can provide appropriate instructions such as, "Calm down and turn left."

[0526] The following describes the processing flow.

[0527] Step 1:

[0528] The terminal uses microphones and cameras mounted on the vehicle, or biometric sensors, to acquire ambient audio data and user emotion data. Audio data is used to detect sirens, and emotion data is obtained from the user's facial expressions and heart rate. This data is temporarily stored in a buffer and prepared to be sent to the server.

[0529] Step 2:

[0530] The device transmits collected voice and emotion data to the server. Data transmission is performed in real time and packetized with high precision for detailed analysis.

[0531] Step 3:

[0532] The server analyzes the audio data transmitted from the terminal and extracts frequency components using FFT. This allows it to identify the siren of an emergency vehicle and determine its approaching direction. The direction is determined using the intensity and phase difference of the audio.

[0533] Step 4:

[0534] The server uses an emotion engine to evaluate the user's emotional state. Based on the fed facial expression data and biometric information, it determines whether the user is relaxed or stressed. This allows it to identify the optimal instructions for the user's emotions.

[0535] Step 5:

[0536] The server generates appropriate sign language videos based on the results of voice analysis and emotion assessment. For example, if an emergency vehicle is approaching and the user is feeling anxious, the server will prepare a video that includes gentle instructions such as "Stay calm and change direction."

[0537] Step 6:

[0538] The terminal displays sign language videos received from the server on a screen inside the vehicle. This display is positioned in a location easily accessible to the user and configured to ensure visibility. Because it is updated in real time, timely information can be provided.

[0539] Step 7:

[0540] The user checks the sign language video displayed on the screen. Following the instructions, they decide on driving actions and respond to emergency vehicles. Emotionally sensitive instructions allow the user to continue driving with greater confidence.

[0541] (Example 2)

[0542] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0543] For drivers with hearing impairments, it is crucial to quickly and accurately recognize approaching emergency vehicles and respond safely. However, conventional technologies are limited to simple analysis of acoustic signals and do not provide individualized support based on the user's emotional state. As a result, even when users are experiencing stress or anxiety, only uniform information can be provided, which hinders effective driving assistance.

[0544] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0545] In this invention, the server includes an analysis means for analyzing acoustic data to identify the acoustic pattern of an emergency vehicle, a state determination means for evaluating the user's emotional state, and a generation means for generating sign language video based on the identified acoustic pattern and the evaluated emotional state. This makes it possible to provide flexible and optimal driving assistance information according to the user's emotional state.

[0546] "Acquisition means" refers to a function that uses equipment installed in the vehicle to collect acoustic data from the surrounding environment.

[0547] The "analysis means" refers to a function that analyzes acquired acoustic data and identifies the acoustic pattern of a specific emergency vehicle.

[0548] A "state determination means" is a function that evaluates the user's current emotional state based on the user's facial expressions and biometric information.

[0549] The "generation means" is a function for creating sign language video containing appropriate instructional information based on identified sound patterns and the user's emotional state.

[0550] "Display means" refers to a function for displaying the generated sign language video on a visual display device inside the vehicle.

[0551] The "direction estimation means" is a function for estimating the direction of approach of an emergency vehicle from the analyzed acoustic pattern.

[0552] This invention is a system designed to enable drivers with hearing impairments to recognize the approach of emergency vehicles and receive appropriate driving assistance. Its specific configuration and implementation method are described below.

[0553] Terminal role:

[0554] The device includes microphones and cameras installed in the vehicle, and through this hardware, it collects ambient acoustic data, user facial expressions, and biometric data. For example, the device can acquire external siren sounds in real time using the microphone, and analyze the user's emotional state using the camera and heart rate sensor. The collected data is transmitted to a server using a secure communication method.

[0555] Server role:

[0556] The server analyzes the acoustic data transmitted from the terminal using algorithms such as FFT (Fast Fourier Transform) to identify the acoustic patterns of emergency vehicles. Furthermore, the server uses an emotion engine to evaluate the user's emotional state from the received facial expression data and biosignals. The necessary software for this is acoustic analysis software and an emotion analysis module. For example, FFT is used to extract specific frequency components and detect siren sounds. Deep learning techniques using AI models can be used for emotion analysis.

[0557] Generation means:

[0558] The server then uses a generation AI model to generate appropriate sign language videos in response to the approaching emergency vehicle. The generated videos include instructions tailored to the user's emotional state. Specifically, a user who is feeling anxious will receive instructions in a gentle tone to help them calm down. An example of a prompt might be, "Emergency vehicle approaching. The user is feeling anxious. Please generate a sign language video with gentle instructions to help them calm down."

[0559] User roles:

[0560] The user views sign language video displayed inside the vehicle via a terminal and selects driving actions based on it. This video includes information such as the direction of approaching emergency vehicles and specific instructions that take the user's emotions into consideration. This allows the user to respond quickly and with confidence.

[0561] The implementation of this system will reduce the burden on drivers and support safe and smooth driving.

[0562] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0563] Step 1:

[0564] The terminal acquires ambient acoustic data using microphones installed in the vehicle. Specifically, it collects acoustic data in real time and temporarily stores it in a buffer. The input for this step is the ambient acoustic environment, and the output is the acoustic data prepared for use.

[0565] Step 2:

[0566] The device acquires the user's facial expressions and biometric data using cameras and heart rate sensors installed inside the vehicle. Specifically, it captures facial expressions with the camera and measures heart rate with the heart rate sensor. The input for this step is the user's facial expressions and biometric signals, and the output is facial expression data and biometric data prepared for analysis.

[0567] Step 3:

[0568] The terminal transmits the collected acoustic data, user facial expression data, and biometric data to the server. Specifically, it encrypts the data using a secure communication protocol and sends it to the server. The input for this step is the acoustic data and the user's facial expression and biometric data, and the output is the data that has reached the server.

[0569] Step 4:

[0570] The server uses FFT (Fast Fourier Transform) to analyze acoustic data and identify the acoustic patterns of emergency vehicles. Specifically, it analyzes the frequency components of the acoustic data to identify siren sounds. The input to this step is the acoustic data sent to the server, and the output is the identified acoustic pattern of the emergency vehicle.

[0571] Step 5:

[0572] The server uses an emotion engine to analyze the user's facial expression data and biometric data to evaluate the user's emotional state. Specifically, it uses a deep learning-based AI model to determine the emotional state. The input for this step is the user's facial expression data and biometric data, and the output is the determined emotional state of the user.

[0573] Step 6:

[0574] The server uses a generative AI model to generate sign language videos based on acoustic patterns and emotional states. Specifically, prompts are used to instruct the AI model, generating sign language videos that meet the specified conditions. The input for this step is the identified acoustic patterns and evaluated emotional states, and the output is the generated sign language video.

[0575] Step 7:

[0576] The terminal displays sign language video received from the server on the in-car display. Specifically, it uses the display's video playback function to provide the video to the user. The input for this step is the sign language video sent from the server, and the output is the visualized sign language video.

[0577] Step 8:

[0578] The user reviews the displayed sign language video and selects appropriate driving actions based on it. Specifically, they operate the car according to the information provided. The input for this step is the sign language video displayed on the terminal, and the output is the user's specific driving actions.

[0579] (Application Example 2)

[0580] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0581] There is a problem in that drivers with hearing impairments have difficulty quickly recognizing the approach of emergency vehicles and taking appropriate action based on their emotional state. Furthermore, conventional systems provide standardized warnings based only on audio information, which means they cannot respond in a way that is tailored to the emotional state of individual drivers.

[0582] The identification processing by the identification processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means. In this invention, the server includes means for acquiring ambient acoustic information, means for analyzing the acoustic information to identify the voice pattern of a specific emergency vehicle, and means for generating visual display data including instruction information based on the identified voice pattern and the user's emotional state. This allows drivers with hearing impairments to visually confirm the approach of an emergency vehicle and to provide a personalized response according to their emotional state at that time.

[0583] "Ambient acoustic information" refers to all sound data acquired from the vehicle's surrounding environment, including emergency vehicle sirens and road noise.

[0584] "Acquisition means" refers to the function of capturing ambient acoustic information using sensors and microphones inside the vehicle.

[0585] The "analysis means" has the function of analyzing acoustic information obtained through the acquisition means and processing it to identify specific sound patterns of emergency vehicles.

[0586] "Generation means" refers to a function that creates visual display data and provides appropriate instructions based on information identified by the analysis means and the user's emotional state.

[0587] "Display means" refers to a function that displays the generated visual display data on a display inside the vehicle or on a similar device to visually communicate information to the driver.

[0588] "Emotional state" refers to the driver's psychological or physiological state, including internal conditions such as stress and tension.

[0589] "Visual display data" refers to digital displays that include instructions and warnings presented to the driver as visual information.

[0590] The system implementing this invention primarily consists of a server and a terminal. The terminal includes a microphone and sensors mounted on the vehicle, used to acquire ambient acoustic information and the driver's emotional state. This information is transmitted to the server via data communication. On the server, an audio analysis algorithm is first used to identify the siren sound of an emergency vehicle from the ambient acoustic information. This can be done using a speech recognition API such as Google Cloud Speech-to-Text.

[0591] Furthermore, to assess the driver's emotional state, biometric information acquired from the device is analyzed using an emotion recognition API such as Affectiva. This makes it possible to generate appropriate instructions based on the driver's stress level and attention level. The generated instruction information is provided as visual display data on the in-vehicle display. The system informs the driver of approaching emergency vehicles and provides action instructions tailored to their current psychological state through visual feedback.

[0592] For example, suppose a driver is driving on a highway. This system constantly monitors the surrounding sounds in the background, and when it detects an emergency vehicle siren, it takes into account the driver's emotional state and visually displays instructions such as "Please remain calm and drive slowly."

[0593] An example of a prompt to the generating AI model is, "Suggest what kind of AR notification would be appropriate to alert the user to the approach of an emergency vehicle, based on the user's emotional state." In this way, this invention provides a system that compensates for the driver's hearing impairment and improves safe driving assistance.

[0594] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0595] Step 1:

[0596] The terminal acquires ambient acoustic information using a microphone mounted on the vehicle. This input data is sent to the server as a raw audio signal.

[0597] Step 2:

[0598] The server uses FFT (Fast Fourier Transform) to decompose the received audio signal data into frequency components and identifies the siren sound of an emergency vehicle. The output provides information on the presence and approach direction of the emergency vehicle.

[0599] Step 3:

[0600] The device uses the driver's facial recognition and biometric sensors to acquire emotional information for the day. This information is sent to a server as data indicating the driver's stress level and concentration level.

[0601] Step 4:

[0602] The server utilizes an emotion recognition API, such as Affectiva, to analyze the received emotional state data. Based on the emotion evaluation, it determines the optimal driving instructions. The output of this process is the specific instructions.

[0603] Step 5:

[0604] The server generates visual display data, creating information that includes instructions best suited to the driver's current situation and emotional state. The generated visual display data is sent to the terminal.

[0605] Step 6:

[0606] The terminal displays the received visual data on the vehicle's display. By visually confirming this, the driver can understand the approach of an emergency vehicle and take appropriate driving actions, and then take countermeasures.

[0607] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0608] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0609] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0610] [Fourth Embodiment]

[0611] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0612] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0613] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0614] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0615] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0616] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0617] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0618] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0619] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0620] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0621] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0622] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0623] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0624] This invention is built as an assistance system to help drivers with hearing impairments recognize the approach of emergency vehicles and drive safely. The system is designed to function smoothly in actual anticipated use cases by having the terminal, server, and user each play their respective roles.

[0625] Terminal role:

[0626] The terminal first acquires ambient audio data using a microphone mounted in the vehicle. This audio data is collected in real time, buffered at regular timeframes, and then sent to the server. The terminal also displays sign language videos containing instructional information received from the server on the vehicle's display. This display provides the user with quick visual feedback, helping them to respond immediately in emergencies.

[0627] Server role:

[0628] The server receives audio data transmitted from the terminal and analyzes the audio waveform using signal processing techniques such as FFT. As a result of the analysis, it identifies audio patterns associated with a specific emergency vehicle and evaluates its approach direction. Based on this information, the server generates an appropriate sign language video. The generated sign language video includes instructional information based on the siren audio pattern and approach direction, prompting the driver to take specific action. The server transmits this sign language video to the terminal, enabling real-time display.

[0629] User roles:

[0630] The driver, as the user, checks the sign language video visually displayed on the terminal and decides on the appropriate driving action according to the current traffic situation. For example, if the video displays the instruction to "turn left," the driver can immediately change direction to avoid obstructing the passage of emergency vehicles. This system makes it possible to continue safe driving without relying on hearing.

[0631] In this way, through a series of processes including the acquisition of voice data on the terminal, voice analysis and sign language video generation on the server, and visual feedback to the user, the present invention provides a safe and smooth vehicle driving experience. For example, in a situation where an ambulance is approaching, a video instruction such as "Emergency vehicle approaching: Stop and yield the right of way" is displayed, allowing the user to respond to the situation safely and quickly.

[0632] The following describes the processing flow.

[0633] Step 1:

[0634] The terminal uses a microphone installed in the vehicle to acquire ambient audio data. The audio data is collected in real time and temporarily stored in a buffer within the terminal. At regular intervals, this audio data is converted into packet format and sent to the server.

[0635] Step 2:

[0636] The server receives audio data transmitted from the terminal. The received data is analyzed by a signal processing algorithm. Specifically, the frequency components of the audio signal are extracted using FFT (Fast Fourier Transform), and characteristic frequency patterns corresponding to emergency vehicle sirens are detected.

[0637] Step 3:

[0638] The server identifies the approach of an emergency vehicle based on the detected frequency pattern. It also evaluates the direction of the audio signal to determine the direction of the approaching emergency vehicle (e.g., left, right, rear). Based on this information, it determines the instructions to provide to the user.

[0639] Step 4:

[0640] The server generates a sign language video corresponding to the selected instruction. This video is formatted using machine learning algorithms and existing video libraries to include appropriate instruction information (e.g., turn left, stop). The sign language video is then encoded and converted into a transmittable data format.

[0641] Step 5:

[0642] The terminal receives sign language video data transmitted from the server. It verifies the integrity of the video data and prepares it for display on the vehicle's display. When displayed, it is positioned appropriately so as not to obstruct the driver's view.

[0643] Step 6:

[0644] The driver, as the user, visually confirms the sign language video displayed on the terminal's screen. Based on the information presented, and considering the current traffic conditions and instructions on the display, they decide on appropriate driving actions. This enables safe and swift driving.

[0645] (Example 1)

[0646] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0647] Drivers with hearing impairments may have difficulty auditorily detecting the approach of emergency vehicles, which can hinder them from taking appropriate driving actions. Therefore, there is a need to recognize the approach of emergency vehicles in a way that does not rely on hearing, and to provide appropriate driving assistance.

[0648] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0649] In this invention, the server includes acquisition means for acquiring ambient sounds, analysis means for analyzing the sounds to identify the voice characteristics of a specific emergency vehicle, and generation means for generating sign language expressions including instruction information based on the identified voice characteristics. This enables drivers with hearing impairments to visually recognize the approach of an emergency vehicle and take appropriate driving actions.

[0650] "Acquisition means" refers to devices and methods for collecting ambient sounds, such as microphones, to obtain audio data.

[0651] "Analysis means" refers to a device or method for processing collected audio data and extracting information according to a specific purpose, and includes the step of identifying audio features.

[0652] "Generation means" refers to devices and methods for creating visual information based on the analyzed results, and specifically includes the process of generating sign language expressions.

[0653] "Display means" refers to devices or methods for outputting generated visual information and presenting it to the user for confirmation, and primarily involves the use of displays.

[0654] "Direction evaluation means" refers to devices or methods for determining the position and direction of an approaching emergency vehicle, and may utilize differences in sound arrival time or changes in volume.

[0655] "Speech features" refer to the characteristics of patterns and signals contained within audio data, and are used as information to identify a specific sound source.

[0656] This invention provides a system installed in a vehicle to visually notify the user of the approach of an emergency vehicle, in order to assist drivers with hearing impairments. The following describes the configuration for implementing the system.

[0657] Device configuration:

[0658] The terminal is installed inside the vehicle and includes a series of hardware devices and software programs. Specifically, it uses a microphone to collect ambient sounds. The acquired audio is stored in an internal buffer. This audio data is transmitted to a server using wireless communication technology. In addition, the terminal is equipped with a high-resolution display that shows sign language expressions sent from the server.

[0659] Server functions:

[0660] The server analyzes the audio data received from the terminal. Here, it analyzes the audio waveform using signal processing techniques such as FFT (Fast Fourier Transform) to identify audio features related to emergency vehicles. Furthermore, it evaluates the time difference in sound arrival to estimate the direction of approach. Based on these analysis results, the server generates sign language expressions using a generative AI model. In the generation process, the prompt message used is "Generate sign language videos corresponding to siren patterns."

[0661] User actions:

[0662] The driver, as the user, visually confirms the sign language expression displayed on the terminal's screen and takes appropriate driving actions according to the traffic situation. For example, if the displayed instruction is "pull over to the right and stop," the user will move the vehicle to the right to give priority to emergency vehicles.

[0663] This system will enable drivers with hearing impairments to properly recognize approaching emergency vehicles and continue driving safely.

[0664] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0665] Step 1:

[0666] The terminal acquires ambient sound data using a microphone mounted on the vehicle. This microphone senses various sounds received from outside the vehicle and stores them as audio signals in the terminal's buffer. The input is ambient sound, and the output is the stored digital audio data.

[0667] Step 2:

[0668] The terminal processes the stored voice data and transmits it to the server via wireless communication. In this transmission process, the voice data is packetized and efficiently delivered to the server over the network. The input is the voice data stored in the terminal, and the output is the packets of voice data sent to the server.

[0669] Step 3:

[0670] The server analyzes the received audio data and identifies specific audio features using FFT and peak detection algorithms. The data analysis performed here aims to convert the audio signal from the time domain to the frequency domain and extract feature patterns associated with emergency vehicle sirens. The input is the audio data sent to the server, and the output is the identified audio features and their associated information.

[0671] Step 4:

[0672] The server generates sign language expressions using a generative AI model based on the analysis results of speech features. In this process, the analysis results are used as prompts to stimulate the AI model, which then outputs a sign language video containing appropriate action instructions for the driver. This video will have content corresponding to the siren pattern and the direction of approach. The input is a prompt based on speech features, and the output is video data of the generated sign language expression.

[0673] Step 5:

[0674] The server sends the generated sign language video to the terminal. The terminal receives this video data and displays it on the in-vehicle display. This allows the user to visually recognize the approach of an emergency vehicle and act according to the instructions. The input is the sign language video data sent from the server, and the output is the sign language expression displayed on the terminal's display.

[0675] Step 6:

[0676] The user sees the sign language indication displayed on the device's screen and decides on the appropriate driving action based on the current situation. For example, if "move to the right" is displayed, the user follows the direction signal and maneuvers the vehicle appropriately to allow emergency vehicles to pass. The input is the information displayed on the device, and the output is the user's driving action.

[0677] (Application Example 1)

[0678] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0679] There is a problem in that hearing-impaired drivers have difficulty visually recognizing approaching emergency vehicles and responding appropriately. In particular, with autonomous vehicles, the lack of visual information provided to the occupants makes it difficult for hearing-impaired drivers to respond to emergencies.

[0680] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0681] In this invention, the server includes information gathering means for acquiring ambient acoustic data, analysis means for analyzing the acoustic data to identify the acoustic pattern of a specific emergency mobility device, and information generation means for generating visual information including operation instruction information based on the identified acoustic pattern. This enables the provision of visual information that allows hearing-impaired individuals to safely respond to emergencies inside an autonomous vehicle.

[0682] "Surroundings" refers to the environment and circumstances that exist outside the moving object in question.

[0683] "Audio data" refers to information acquired as sound or acoustic signals.

[0684] "Information gathering means" refers to devices and methods used to acquire necessary data.

[0685] "Specific emergency means of transportation" refers to vehicles used in emergencies, such as ambulances and police vehicles.

[0686] An "acoustic pattern" refers to the characteristic waveform or signal arrangement of a particular sound or audio.

[0687] "Analysis means" refers to devices and methods used to analyze acquired data and extract necessary information.

[0688] "Action instruction information" refers to information that indicates what action should be taken towards the target.

[0689] "Visual information" refers to information provided in a format that humans can see and recognize.

[0690] "Information generation means" refers to devices or methods that create necessary information based on analysis results.

[0691] "Information output means" refers to devices or methods for presenting generated information to users.

[0692] A system implementing this invention includes a program for analyzing ambient acoustic data and providing visual information to the user based on that data.

[0693] Server Role

[0694] The server begins by acquiring acoustic data. To fulfill this role, the server uses analytical tools to analyze the acoustic data. This analysis employs signal processing techniques such as FFT (Fast Fourier Transform). By analyzing the acoustic data, the server identifies acoustic patterns associated with specific emergency mobility devices and evaluates their approach direction. The data that forms the basis of the acoustic data is collected using speech recognition libraries such as the Google Cloud Speech-to-Text API.

[0695] Terminal role

[0696] The terminal displays visual information received from the server to the user. This visual information is in a visually easily recognizable format and provides the user with action instructions. Information is conveyed visually through the terminal's display, showing specific instructions such as "turn left" or "stop and yield." Machine learning models like TensorFlow are used to generate this information.

[0697] User roles

[0698] Users review the visual information displayed on the device and take appropriate action according to the provided instructions. This plays a crucial role in helping them make sound judgments while driving. Visual information enables safe driving without relying on hearing.

[0699] Specific example

[0700] For example, when an ambulance is approaching, a message such as "Emergency vehicle approaching: Stop and yield the right of way" will be displayed. This allows the user to respond safely and quickly.

[0701] Example of a Generated AI Model Prompt

[0702] "Please describe the process of generating a sign language video and displaying it on the screen when an ambulance siren is detected."

[0703] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0704] Step 1:

[0705] The server receives ambient acoustic data from the terminal. This acoustic data is acquired by microphones mounted on the vehicle and transmitted to the server. The server receives raw audio waveform data as input and uses this data as output for the next analysis step.

[0706] Step 2:

[0707] The server uses FFT (Fast Fourier Transform) to analyze the received acoustic data. The input is the acoustic data obtained in step 1, and the frequency components are analyzed by applying FFT to it. The output is the frequency spectrum, which serves as the basis for identifying the acoustic patterns of specific emergency vehicles. This analysis allows for the detection of specific acoustic patterns of emergency vehicles.

[0708] Step 3:

[0709] Based on the analysis results, the server determines the acoustic pattern of the identified emergency vehicle and the corresponding operational instructions based on its approach direction. The input for this step is the acoustic pattern and approach direction identified in step 2, and the output is information containing appropriate operational instructions. A machine learning model based on historical data is used to generate this information.

[0710] Step 4:

[0711] The terminal receives visual information transmitted from the server. The input is visual information, including operational instructions, generated by the server, and the output is the visual presentation of that information on the display. The terminal notifies the user by displaying this information on the screen.

[0712] Step 5:

[0713] The user confirms the visual information displayed on the terminal's screen and takes appropriate driving actions based on the presented instructions. The input is the visual instructions displayed by the terminal, and the output is the user's driving actions based on that information. This allows the user to continue driving safely and consciously.

[0714] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0715] This invention provides a system that enables drivers with hearing impairments to recognize the approach of emergency vehicles and to provide support tailored to the user's emotional state. The system is primarily built around a terminal, server, and user, and incorporates an emotion engine to achieve more personalized and effective support.

[0716] Terminal role:

[0717] The device acquires ambient audio data through microphones installed in the vehicle. It can also utilize facial recognition and biometric sensors to understand the user's emotional state. This information is used to adjust the system to ensure safe and comfortable driving. The collected data is sent to a server for analysis.

[0718] Server role:

[0719] The server analyzes audio data transmitted from the terminal to identify the audio patterns of emergency vehicles. It uses FFT to analyze the frequency components of the audio signal and detect specific audio patterns. This information is also useful in determining the direction from which the emergency vehicle is approaching. Furthermore, the server uses an emotion engine to assess the user's emotional state. This emotional state is used to adjust the content and tone of the generated sign language video, providing instructions appropriate to the user's mental state.

[0720] User roles:

[0721] The driver, as the user, checks sign language videos displayed on the vehicle's screen via a terminal. For example, if the system detects that the user is experiencing stress, a sign language video with more detailed instructions in a gentle and positive tone is provided. Based on this feedback, the user can choose appropriate driving actions in response to the approach of an emergency vehicle.

[0722] In this way, a system that incorporates an emotion engine provides more flexible and effective driving assistance according to the driver's individual state. In this embodiment, by comprehensively analyzing the user's emotions and the surrounding sound environment, it is possible to reduce the driver's psychological burden and create a more comfortable and safe driving environment. For example, if the emotion engine detects that the user is tense, it can provide appropriate instructions such as, "Calm down and turn left."

[0723] The following describes the processing flow.

[0724] Step 1:

[0725] The terminal uses microphones and cameras mounted on the vehicle, or biometric sensors, to acquire ambient audio data and user emotion data. Audio data is used to detect sirens, and emotion data is obtained from the user's facial expressions and heart rate. This data is temporarily stored in a buffer and prepared to be sent to the server.

[0726] Step 2:

[0727] The device transmits collected voice and emotion data to the server. Data transmission is performed in real time and packetized with high precision for detailed analysis.

[0728] Step 3:

[0729] The server analyzes the audio data transmitted from the terminal and extracts frequency components using FFT. This allows it to identify the siren of an emergency vehicle and determine its approaching direction. The direction is determined using the intensity and phase difference of the audio.

[0730] Step 4:

[0731] The server uses an emotion engine to evaluate the user's emotional state. Based on the fed facial expression data and biometric information, it determines whether the user is relaxed or stressed. This allows it to identify the optimal instructions for the user's emotions.

[0732] Step 5:

[0733] The server generates appropriate sign language videos based on the results of voice analysis and emotion assessment. For example, if an emergency vehicle is approaching and the user is feeling anxious, the server will prepare a video that includes gentle instructions such as "Stay calm and change direction."

[0734] Step 6:

[0735] The terminal displays sign language videos received from the server on a screen inside the vehicle. This display is positioned in a location easily accessible to the user and configured to ensure visibility. Because it is updated in real time, timely information can be provided.

[0736] Step 7:

[0737] The user checks the sign language video displayed on the screen. Following the instructions, they decide on driving actions and respond to emergency vehicles. Emotionally sensitive instructions allow the user to continue driving with greater confidence.

[0738] (Example 2)

[0739] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0740] For drivers with hearing impairments, it is crucial to quickly and accurately recognize approaching emergency vehicles and respond safely. However, conventional technologies are limited to simple analysis of acoustic signals and do not provide individualized support based on the user's emotional state. As a result, even when users are experiencing stress or anxiety, only uniform information can be provided, which hinders effective driving assistance.

[0741] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0742] In this invention, the server includes an analysis means for analyzing acoustic data to identify the acoustic pattern of an emergency vehicle, a state determination means for evaluating the user's emotional state, and a generation means for generating sign language video based on the identified acoustic pattern and the evaluated emotional state. This makes it possible to provide flexible and optimal driving assistance information according to the user's emotional state.

[0743] "Acquisition means" refers to a function that uses equipment installed in the vehicle to collect acoustic data from the surrounding environment.

[0744] The "analysis means" refers to a function that analyzes acquired acoustic data and identifies the acoustic pattern of a specific emergency vehicle.

[0745] A "state determination means" is a function that evaluates the user's current emotional state based on the user's facial expressions and biometric information.

[0746] The "generation means" is a function for creating sign language video containing appropriate instructional information based on identified sound patterns and the user's emotional state.

[0747] "Display means" refers to a function for displaying the generated sign language video on a visual display device inside the vehicle.

[0748] The "direction estimation means" is a function for estimating the direction of approach of an emergency vehicle from the analyzed acoustic pattern.

[0749] This invention is a system designed to enable drivers with hearing impairments to recognize the approach of emergency vehicles and receive appropriate driving assistance. Its specific configuration and implementation method are described below.

[0750] Terminal role:

[0751] The device includes microphones and cameras installed in the vehicle, and through this hardware, it collects ambient acoustic data, user facial expressions, and biometric data. For example, the device can acquire external siren sounds in real time using the microphone, and analyze the user's emotional state using the camera and heart rate sensor. The collected data is transmitted to a server using a secure communication method.

[0752] Server role:

[0753] The server analyzes the acoustic data transmitted from the terminal using algorithms such as FFT (Fast Fourier Transform) to identify the acoustic patterns of emergency vehicles. Furthermore, the server uses an emotion engine to evaluate the user's emotional state from the received facial expression data and biosignals. The necessary software for this is acoustic analysis software and an emotion analysis module. For example, FFT is used to extract specific frequency components and detect siren sounds. Deep learning techniques using AI models can be used for emotion analysis.

[0754] Generation means:

[0755] The server then uses a generation AI model to generate appropriate sign language videos in response to the approaching emergency vehicle. The generated videos include instructions tailored to the user's emotional state. Specifically, a user who is feeling anxious will receive instructions in a gentle tone to help them calm down. An example of a prompt might be, "Emergency vehicle approaching. The user is feeling anxious. Please generate a sign language video with gentle instructions to help them calm down."

[0756] User roles:

[0757] The user views sign language video displayed inside the vehicle via a terminal and selects driving actions based on it. This video includes information such as the direction of approaching emergency vehicles and specific instructions that take the user's emotions into consideration. This allows the user to respond quickly and with confidence.

[0758] The implementation of this system will reduce the burden on drivers and support safe and smooth driving.

[0759] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0760] Step 1:

[0761] The terminal acquires ambient acoustic data using microphones installed in the vehicle. Specifically, it collects acoustic data in real time and temporarily stores it in a buffer. The input for this step is the ambient acoustic environment, and the output is the acoustic data prepared for use.

[0762] Step 2:

[0763] The device acquires the user's facial expressions and biometric data using cameras and heart rate sensors installed inside the vehicle. Specifically, it captures facial expressions with the camera and measures heart rate with the heart rate sensor. The input for this step is the user's facial expressions and biometric signals, and the output is facial expression data and biometric data prepared for analysis.

[0764] Step 3:

[0765] The terminal transmits the collected acoustic data, user facial expression data, and biometric data to the server. Specifically, it encrypts the data using a secure communication protocol and sends it to the server. The input for this step is the acoustic data and the user's facial expression and biometric data, and the output is the data that has reached the server.

[0766] Step 4:

[0767] The server uses FFT (Fast Fourier Transform) to analyze acoustic data and identify the acoustic patterns of emergency vehicles. Specifically, it analyzes the frequency components of the acoustic data to identify siren sounds. The input to this step is the acoustic data sent to the server, and the output is the identified acoustic pattern of the emergency vehicle.

[0768] Step 5:

[0769] The server uses an emotion engine to analyze the user's facial expression data and biometric data to evaluate the user's emotional state. Specifically, it uses a deep learning-based AI model to determine the emotional state. The input for this step is the user's facial expression data and biometric data, and the output is the determined emotional state of the user.

[0770] Step 6:

[0771] The server uses a generative AI model to generate sign language videos based on acoustic patterns and emotional states. Specifically, prompts are used to instruct the AI model, generating sign language videos that meet the specified conditions. The input for this step is the identified acoustic patterns and evaluated emotional states, and the output is the generated sign language video.

[0772] Step 7:

[0773] The terminal displays sign language video received from the server on the in-car display. Specifically, it uses the display's video playback function to provide the video to the user. The input for this step is the sign language video sent from the server, and the output is the visualized sign language video.

[0774] Step 8:

[0775] The user reviews the displayed sign language video and selects appropriate driving actions based on it. Specifically, they operate the car according to the information provided. The input for this step is the sign language video displayed on the terminal, and the output is the user's specific driving actions.

[0776] (Application Example 2)

[0777] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0778] There is a problem in that drivers with hearing impairments have difficulty quickly recognizing the approach of emergency vehicles and taking appropriate action based on their emotional state. Furthermore, conventional systems provide standardized warnings based only on audio information, which means they cannot respond in a way that is tailored to the emotional state of individual drivers.

[0779] The identification processing by the identification processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means. In this invention, the server includes means for acquiring ambient acoustic information, means for analyzing the acoustic information to identify the voice pattern of a specific emergency vehicle, and means for generating visual display data including instruction information based on the identified voice pattern and the user's emotional state. This allows drivers with hearing impairments to visually confirm the approach of an emergency vehicle and to provide a personalized response according to their emotional state at that time.

[0780] "Ambient acoustic information" refers to all sound data acquired from the vehicle's surrounding environment, including emergency vehicle sirens and road noise.

[0781] "Acquisition means" refers to the function of capturing ambient acoustic information using sensors and microphones inside the vehicle.

[0782] The "analysis means" has the function of analyzing acoustic information obtained through the acquisition means and processing it to identify specific sound patterns of emergency vehicles.

[0783] "Generation means" refers to a function that creates visual display data and provides appropriate instructions based on information identified by the analysis means and the user's emotional state.

[0784] "Display means" refers to a function that displays the generated visual display data on a display inside the vehicle or on a similar device to visually communicate information to the driver.

[0785] "Emotional state" refers to the driver's psychological or physiological state, including internal conditions such as stress and tension.

[0786] "Visual display data" refers to digital displays that include instructions and warnings presented to the driver as visual information.

[0787] The system implementing this invention primarily consists of a server and a terminal. The terminal includes a microphone and sensors mounted on the vehicle, used to acquire ambient acoustic information and the driver's emotional state. This information is transmitted to the server via data communication. On the server, an audio analysis algorithm is first used to identify the siren sound of an emergency vehicle from the ambient acoustic information. This can be done using a speech recognition API such as Google Cloud Speech-to-Text.

[0788] Furthermore, to assess the driver's emotional state, biometric information acquired from the device is analyzed using an emotion recognition API such as Affectiva. This makes it possible to generate appropriate instructions based on the driver's stress level and attention level. The generated instruction information is provided as visual display data on the in-vehicle display. The system informs the driver of approaching emergency vehicles and provides action instructions tailored to their current psychological state through visual feedback.

[0789] For example, suppose a driver is driving on a highway. This system constantly monitors the surrounding sounds in the background, and when it detects an emergency vehicle siren, it takes into account the driver's emotional state and visually displays instructions such as "Please remain calm and drive slowly."

[0790] An example of a prompt to the generating AI model is, "Suggest what kind of AR notification would be appropriate to alert the user to the approach of an emergency vehicle, based on the user's emotional state." In this way, this invention provides a system that compensates for the driver's hearing impairment and improves safe driving assistance.

[0791] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0792] Step 1:

[0793] The terminal acquires ambient acoustic information using a microphone mounted on the vehicle. This input data is sent to the server as a raw audio signal.

[0794] Step 2:

[0795] The server uses FFT (Fast Fourier Transform) to decompose the received audio signal data into frequency components and identifies the siren sound of an emergency vehicle. The output provides information on the presence and approach direction of the emergency vehicle.

[0796] Step 3:

[0797] The device uses the driver's facial recognition and biometric sensors to acquire emotional information for the day. This information is sent to a server as data indicating the driver's stress level and concentration level.

[0798] Step 4:

[0799] The server utilizes an emotion recognition API, such as Affectiva, to analyze the received emotional state data. Based on the emotion evaluation, it determines the optimal driving instructions. The output of this process is the specific instructions.

[0800] Step 5:

[0801] The server generates visual display data, creating information that includes instructions best suited to the driver's current situation and emotional state. The generated visual display data is sent to the terminal.

[0802] Step 6:

[0803] The terminal displays the received visual data on the vehicle's display. By visually confirming this, the driver can understand the approach of an emergency vehicle and take appropriate driving actions, and then take countermeasures.

[0804] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0805] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0806] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0807] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0808] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0809] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0810] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0811] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0812] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0813] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0814] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0815] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0816] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0817] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0818] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0819] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0820] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0821] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0822] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0823] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0824] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted as being incorporated by reference.

[0825] The following is further disclosed regarding the embodiments described above.

[0826] (Claim 1)

[0827] A means for acquiring surrounding audio data,

[0828] Analysis means for analyzing the aforementioned audio data to identify the audio pattern of a specific emergency vehicle,

[0829] A generation means for generating a sign language video containing instruction information based on the identified voice pattern,

[0830] A display means for displaying the aforementioned sign language video on the vehicle's display,

[0831] A system that includes this.

[0832] (Claim 2)

[0833] The system according to claim 1, wherein the analysis means further includes direction recognition means for evaluating the direction of approach of an emergency vehicle.

[0834] (Claim 3)

[0835] The system according to claim 1, wherein the generation means generates a plurality of sign language videos containing different instruction information corresponding to the approaching direction of the emergency vehicle.

[0836] "Example 1"

[0837] (Claim 1)

[0838] A means of acquiring ambient sounds,

[0839] Analysis means for analyzing the aforementioned sound to identify the voice characteristics of a specific emergency vehicle,

[0840] A generation means for generating a sign language expression including instruction information based on the identified speech features,

[0841] A display means for displaying the aforementioned sign language expression on a vehicle's display device,

[0842] A system that includes this.

[0843] (Claim 2)

[0844] The system according to claim 1, wherein the analysis means further includes a direction evaluation means for evaluating the direction of approach of an emergency vehicle.

[0845] (Claim 3)

[0846] The system according to claim 1, wherein the generation means generates a plurality of sign language expressions that include different instruction information corresponding to the approaching direction of the emergency vehicle.

[0847] "Application Example 1"

[0848] (Claim 1)

[0849] Information gathering means for acquiring ambient acoustic data,

[0850] An analysis means for analyzing the aforementioned acoustic data to identify the acoustic pattern of a specific emergency transportation means,

[0851] Information generation means for generating visual information including operation instruction information based on the identified acoustic pattern,

[0852] Output means for visually presenting the aforementioned visual information to the information output means of a mobile body,

[0853] A system that includes this.

[0854] (Claim 2)

[0855] The system according to claim 1, wherein the analysis means further includes direction identification means for determining the approach direction of the emergency movement means.

[0856] (Claim 3)

[0857] The system according to claim 1, wherein the information generation means generates a plurality of visual pieces of information, including different operational instruction information corresponding to the approach direction of the emergency movement means.

[0858] "Example 2 of combining an emotion engine"

[0859] (Claim 1)

[0860] A means for acquiring ambient acoustic data,

[0861] An analysis means for analyzing the aforementioned acoustic data to identify the acoustic pattern of a specific emergency vehicle,

[0862] A state determination means for evaluating the user's emotional state,

[0863] A generation means for generating sign language video containing instructional information based on the identified acoustic pattern and emotional state,

[0864] A display means for displaying the aforementioned sign language video,

[0865] A system that includes this.

[0866] (Claim 2)

[0867] The system according to claim 1, wherein the analysis means further includes a direction estimation means for evaluating the direction of approach of an emergency vehicle.

[0868] (Claim 3)

[0869] The system according to claim 1, wherein the generation means generates a plurality of sign language videos that include different instruction information corresponding to the approaching direction of the emergency vehicle and the emotional state of the user.

[0870] "Application example 2 when combining with an emotional engine"

[0871] (Claim 1)

[0872] A means for acquiring ambient acoustic information,

[0873] Analysis means for analyzing the aforementioned acoustic information to identify the sound pattern of a specific emergency vehicle,

[0874] A generation means for generating visual display data including instruction information based on the identified voice pattern and the user's emotional state,

[0875] A display means for displaying the aforementioned visual display data on an in-vehicle display device,

[0876] A system that includes this.

[0877] (Claim 2)

[0878] The system according to claim 1, wherein the analysis means further includes a direction detection means for evaluating the direction of approach of an emergency vehicle, and further includes a means for evaluating the emotional state of the user.

[0879] (Claim 3)

[0880] The system according to claim 1, wherein the generation means generates a plurality of visual display data including different instruction information corresponding to the approaching direction of the emergency vehicle and the emotional state of the user. [Explanation of symbols]

[0881] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A means for acquiring surrounding audio data, Analysis means for analyzing the aforementioned audio data to identify the audio pattern of a specific emergency vehicle, A generation means for generating a sign language video containing instruction information based on the identified voice pattern, A display means for displaying the aforementioned sign language video on the vehicle's display, A system that includes this.

2. The system according to claim 1, wherein the analysis means further includes a direction recognition means for evaluating the direction of approach of an emergency vehicle.

3. The system according to claim 1, wherein the generation means generates a plurality of sign language videos containing different instruction information corresponding to the approaching direction of the emergency vehicle.