system
A system analyzes ambient sounds to generate sign language videos for visually impaired drivers, addressing the challenge of recognizing emergency vehicles and improving driving safety by providing real-time visual instructions.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- SOFTBANK GROUP CORP
- Filing Date
- 2024-12-10
- Publication Date
- 2026-06-22
AI Technical Summary
Visually impaired drivers face challenges in quickly and accurately recognizing emergency vehicles, with conventional information identification and presentation technologies prone to delays and inaccuracies, impairing safe driving.
A system that utilizes a server to analyze ambient sounds for emergency vehicle sirens, estimate direction and distance, and generate sign language videos displayed on a vehicle's head-up display to provide quick and accurate visual instructions.
Enables visually impaired drivers to respond quickly and safely to emergencies by providing real-time, accurate visual information, enhancing driving safety.
Smart Images

Figure 2026101227000001_ABST
Abstract
Description
Technical Field
[0004] , , , ,
[0005] , , , , , ,
[0001] The technology of the present disclosure relates to a system.
Background Art
[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor and includes steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of the chatbot's character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.
Prior Art Documents
Patent Documents
[0003]
Patent Document 1
[0006] "Acoustic detection means" refers to a device or method that receives sound from an external source and converts that data into a format that can be processed as an electrical signal.
[0007] "Acoustic analysis means" refers to a device or algorithm for analyzing audio data collected by acoustic detection means to detect specific sounds or patterns.
[0008] "Video generation means" refers to a device or process that creates visual instructions based on analysis results and outputs them as video in a specific format.
[0009] "Display control means" refers to a device or method for projecting the generated image onto a display device at an appropriate timing to convey visual information to the driver.
[0010] "Estimation means" refers to a device or process for estimating the direction and distance of a sound source based on acoustic analysis results. [Brief explanation of the drawing]
[0011] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14] This is a sequence diagram showing the processing flow of the data processing system in Application Example 2, which combines an emotion engine. [Modes for carrying out the invention]
[0012] Hereinafter, an example of an embodiment of the system relating to the technology of this disclosure will be described with reference to the attached drawings.
[0013] First, let's explain the terminology used in the following explanation.
[0014] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.
[0015] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.
[0016] In the following embodiments, the numbered storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.
[0017] In the following embodiments, the numbered communication I / F (Interface) is an interface including a communication processor and an antenna, etc. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark), and the like.
[0018] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."
[0019] [First Embodiment]
[0020] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.
[0021] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.
[0022] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0023] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.
[0024] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.
[0025] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.
[0026] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.
[0027] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.
[0028] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.
[0029] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0030] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0031] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0032] This invention is a system for providing emergency information to visually impaired drivers using sound, and involves a server, a terminal, and a user.
[0033] First, the terminal continuously collects ambient sounds via microphones installed in the vehicle. The collected audio data is transmitted to a server. The server analyzes the received audio data and performs digital signal processing to identify specific sounds, such as the siren of an emergency vehicle. If this acoustic analysis detects the approach of an emergency vehicle, the system proceeds to the next stage.
[0034] From the detected acoustic data, the server estimates the direction and distance. Based on this, a video generation program within the server generates sign language video instructions for the driver. These videos include specific instructions such as "turn right," "turn left," and "stop."
[0035] The generated sign language video is transmitted directly to the terminal. The terminal immediately displays this video on the vehicle's head-up display, providing the user with visual information. Based on the displayed sign language video, the user can quickly perform driving maneuvers and take appropriate actions to safely avoid emergency vehicles.
[0036] For example, if the terminal detects the sound of an emergency vehicle's siren, the server analyzes the direction of the sound and confirms that it is approaching from the right rear. Based on this, the server generates a sign language video instructing the user to "turn right" and sends it to the terminal. The user then confirms this, appropriately moves their vehicle to the right, and yields to the emergency vehicle.
[0037] This system enables drivers with hearing impairments to respond quickly and safely to emergency vehicles, providing driving assistance.
[0038] The following describes the processing flow.
[0039] Step 1:
[0040] The terminal uses a microphone mounted on the vehicle to continuously collect ambient sounds. The collected audio data undergoes noise reduction processing before being sent to a server to improve the accuracy of sound detection.
[0041] Step 2:
[0042] The server receives audio data transmitted from the terminal. Using a digital signal processing algorithm, the server analyzes the audio data and detects specific sound patterns, such as emergency vehicle sirens.
[0043] Step 3:
[0044] When the sound of an emergency vehicle's siren is detected, the server estimates its direction and distance. This estimation utilizes changes in the siren's intensity and frequency.
[0045] Step 4:
[0046] The server assesses the urgency of the situation and determines the necessary driving instructions based on that assessment. These instructions include options such as "turn right," "turn left," and "stop," and the most appropriate one is selected depending on the situation.
[0047] Step 5:
[0048] The AI on the server generates sign language videos based on the determined driving instructions. These sign language videos use actions and symbols to express the instructions and are designed to be easily understood visually.
[0049] Step 6:
[0050] The generated sign language video is transmitted to the terminal in real time. A communication protocol is used to minimize delays, enabling drivers to respond quickly.
[0051] Step 7:
[0052] The terminal receives sign language video sent from the server and immediately displays it on the head-up display. This allows the user to visually receive driving instructions.
[0053] Step 8:
[0054] The user checks the displayed sign language video and performs driving operations accordingly. For example, if the instruction is "turn right," the user turns the steering wheel to the right and takes appropriate action to avoid an emergency vehicle.
[0055] (Example 1)
[0056] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0057] It is difficult for visually impaired drivers to recognize surrounding emergencies appropriately and quickly. Therefore, there is a need for effective means of providing information to warn of approaching emergency vehicles. Furthermore, conventional information identification and presentation technologies are prone to delays and inaccuracies in information, which can impair safe driving.
[0058] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0059] In this invention, the server includes an acoustic collection means for continuously collecting ambient sounds, an acoustic analysis means for analyzing the transmitted sound data and identifying specific sounds, a position estimation means for estimating the direction and distance of sounds from the analysis results, and a video generation means for generating instructions as video. This makes it possible to quickly and accurately notify visually impaired drivers of emergencies through visual information.
[0060] "Acoustic acquisition means" refers to a device or system for continuously acquiring ambient sounds.
[0061] "Data transmission means" refers to a method or technique for transmitting collected acoustic information to another device or system.
[0062] "Acoustic analysis means" refers to a technology or device for analyzing sound data and identifying specific sounds from it.
[0063] "Position estimation means" refers to a method or system for estimating the direction and distance of a sound source.
[0064] "Image generation means" refers to a technology or device for creating specific instruction information as a visual image.
[0065] "Display means" refers to a device or system for visually showing the generated image to the user.
[0066] This invention is a support system for quickly notifying visually impaired drivers of approaching emergency vehicles through audio and visual information.
[0067] Hardware and software for implementation
[0068] First, the terminal functions as an audio collection device installed in the vehicle. This device incorporates a high-sensitivity microphone to collect ambient sounds in real time. The collected audio information is transmitted to a server via a network module. To optimize network transmission, the device often utilizes Wi-Fi or mobile communication technologies.
[0069] The server is equipped with acoustic analysis software to analyze the received audio data. This analysis utilizes acoustic signal processing libraries, with Librosa being a specific example. Using this library, the server performs frequency analysis to identify the siren sound of a particular emergency vehicle.
[0070] Based on the audio analysis results, the server also runs a position estimation algorithm to estimate the direction and distance of the sound source. This algorithm employs a technique that uses the time difference of the acoustic signals.
[0071] To visualize the instructions, the server generates driving instructions for the user as sign language videos, based on a generation AI model. By utilizing AI generation technology in this process, video generation is performed quickly.
[0072] The generated video is sent back to the terminal and displayed on the vehicle's head-up display. This allows the user to take intuitive driving actions.
[0073] Specific example
[0074] For example, if the microphone picks up the sound of an emergency vehicle's siren while the user is driving, the server will analyze the results to indicate that the sound is approaching from the right rear. Based on this, a sign language video instructing the user to "turn right" is generated and displayed on the device. The user can then confirm this and quickly move their vehicle to the right to safely yield to the emergency vehicle.
[0075] Example of a prompt
[0076] "It analyzes surrounding audio data to detect the approach of emergency vehicles. Based on the direction and distance of the sound, it generates driving instructions in sign language and notifies the driver."
[0077] Based on the above, this invention can provide important support for visually impaired drivers to drive safely.
[0078] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0079] Step 1:
[0080] The terminal continuously collects ambient sounds using a high-sensitivity microphone installed in the vehicle. The input for this process is audio signals from the external environment, and the output is digital audio data. This digital audio data is temporarily stored within the terminal before being compressed and encrypted.
[0081] Step 2:
[0082] The device compresses the collected audio data and applies a compression algorithm to enable efficient network transmission. Next, the compressed data is encrypted to protect its confidentiality. The output of this process is encrypted and compressed audio data in a transferable format. This data is then sent to the server using Wi-Fi or 4G / 5G communication.
[0083] Step 3:
[0084] The server decrypts and removes encryption from the audio data received over the network. The input is encrypted, compressed audio data, and the output is uncompressed audio data in a format that can be processed. Acoustic analysis is then performed using this data.
[0085] Step 4:
[0086] The server performs acoustic analysis on uncompressed audio data. Here, the Librosa library is used to analyze the frequency spectrum and detect emergency vehicle siren sounds. The input is uncompressed audio data, and the output is a determination of whether or not it contains sounds of a specific emergency. Further processing is then carried out based on this result.
[0087] Step 5:
[0088] When a specific emergency sound is detected, the server estimates the direction and distance of the sound source. In this step, sound localization techniques are used to calculate the direction, followed by further data processing. The input is time-delay information of the audio signal, and the output is the estimated direction and distance of the sound source.
[0089] Step 6:
[0090] The server generates instructions for the user based on estimated sound source information. Here, a generation AI model is used to generate sign language videos to visually instruct the driver. The input is the direction and distance of the sound source, and the output is video data corresponding to the instruction content.
[0091] Step 7:
[0092] The generated sign language video is transmitted to the terminal. The terminal receives the video data and displays it on the head-up display inside the vehicle. The input is video data, and the output is a visual notification to the user.
[0093] Step 8:
[0094] The user views sign language images displayed on a head-up display and performs appropriate driving actions based on those images. This allows the user to safely respond to emergency vehicles. The input is sign language images, and the output is driving actions performed in accordance with the instructions.
[0095] (Application Example 1)
[0096] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0097] Drivers with visual or hearing impairments face challenges in accurately recognizing and responding quickly and safely to emergencies occurring in their surroundings. Therefore, there is a need for systems that accurately notify drivers of important situations, such as the approach of emergency vehicles, and support safe driving.
[0098] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0099] In this invention, the server includes detection means for detecting sound, analysis means for analyzing the characteristics of the sound, and generation means for visualizing the instruction content based on the analysis results. This enables drivers with visual or auditory limitations to accurately perceive the situation and perform appropriate driving operations.
[0100] A "sound detection means" refers to a device or system that senses ambient sounds and acquires them as digital data.
[0101] "Analysis means for analyzing the characteristics of the sound" refers to a device or system that is responsible for the process of analyzing the characteristics of acquired audio data and identifying specific sounds or patterns.
[0102] "Generating means for visualizing instruction content based on analysis results" refers to a device or system that generates information in a format that visually conveys the instructions derived from the analyzed voice data to the driver.
[0103] "Estimation means for estimating direction and distance" refers to a device or system that measures the direction and distance of a sound source from audio data and presents that positional relationship to the driver.
[0104] "Generating visual information using sign language instructions" means expressing specific instructions as images using sign language as a visual method.
[0105] This invention is a system in which a terminal installed in a vehicle while it is in motion senses ambient sounds, analyzes the situation such as the approach of an emergency vehicle using voice analysis, and provides visual information to the driver.
[0106] First, the device collects ambient sounds from the vehicle in order to detect sound. Specifically, it acquires surrounding audio data using the microphone of a smartphone or a dedicated device.
[0107] Next, the server receives the audio data and analyzes its characteristics. Using speech recognition technologies such as Google® Cloud Speech-to-Text API and Amazon Transcribe, it analyzes the audio signal in real time. This allows it to recognize and identify the siren sounds of emergency vehicles.
[0108] Based on the analysis results, the instructions are visualized using a generative AI model. A deep learning framework such as TENSORFLOW® is used to generate instructional videos, including sign language. These videos are transmitted to smartphones and other display devices. The driver then uses this visual information to perform appropriate driving operations.
[0109] For example, if the sound of an emergency vehicle's siren is approaching from the right rear, the server analyzes the data and generates a specific sign language video indicating "turn right," which is then displayed on the terminal. This information allows the driver to quickly move their vehicle to the right and yield to the emergency vehicle.
[0110] An example of a prompt message would be: "Use your smartphone's microphone to analyze ambient sounds, identify emergency vehicle sirens, and generate sign language video based on the direction of the sound."
[0111] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0112] Step 1:
[0113] The device uses the microphone of a smartphone or other device to collect ambient sounds. The input is an audio signal, and the output is generated as audio data. This data is transmitted to a server in real time. Specifically, it captures noise inside the vehicle and external sounds and converts them into a digital format.
[0114] Step 2:
[0115] The server analyzes the received audio data. The input is the audio data generated in step 1, and the output is the identification of specific patterns or sounds within the audio. Here, the data is analyzed using tools such as the Google Cloud Speech-to-Text API to identify sounds like emergency vehicle sirens. Specifically, this involves analyzing the frequency characteristics and volume changes within the data.
[0116] Step 3:
[0117] The server generates instructional videos based on the analysis results. The input is the sound identification result obtained in step 2, and the output is instructional videos using sign language. Generative AI models such as TensorFlow are used to visualize specific instructions and generate videos. Specifically, the system renders sign language frames based on the analysis results.
[0118] Step 4:
[0119] The server sends the generated instruction video to the terminal. The input is the visual information created in step 3, and the output is the video that can be viewed on the terminal. The terminal then performs the specific action of projecting this video onto a display device. This provides the user with real-time visual instructions.
[0120] Step 5:
[0121] The user performs driving operations based on sign language video displayed on the terminal. The input is the sign language video displayed in step 4, and the output is the change in the user's driving behavior. Specifically, actions are taken to ensure safety, such as appropriately changing lanes in response to the approach of an emergency vehicle.
[0122] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0123] This invention combines a system that uses sound to provide emergency information to the hearing impaired with an emotion engine that recognizes the user's emotions and provides appropriate feedback. This enables driving assistance that takes the user's mental state into consideration.
[0124] First, the terminal collects ambient sounds through a microphone installed in the vehicle. The collected audio data is sent to a server via the network. The server analyzes this audio data in detail and detects sound patterns such as the sirens of emergency vehicles. When an emergency is detected through this analysis, the server estimates the direction and distance of the sound source and determines the necessary driving instructions.
[0125] In addition to acoustic analysis, the emotion engine receives input from the user's facial expressions and body sensors to understand the user's current emotional state. Based on this data, the server analyzes the user's emotions and provides appropriate feedback if stress or anxiety is elevated.
[0126] For example, if the server recognizes that an emergency vehicle is approaching from behind and the emotion engine determines that the user is feeling anxious, the video generation program will convert the instruction to "turn right" into sign language in a calm and soothing tone and display it on the terminal. In this case, easier-to-understand expressions or voice guidance may also be added.
[0127] The terminal immediately displays the generated sign language video and audio feedback on the head-up display, allowing the user to perform driving maneuvers to avoid emergency vehicles based on the provided instructions.
[0128] Thus, the present invention makes it possible to provide a safe and comfortable driving environment through advanced responses that reflect the user's emotional state.
[0129] The following describes the processing flow.
[0130] Step 1:
[0131] The terminal continuously collects ambient sounds through microphones placed in the vehicle. This collected audio data is transmitted to a server in real time.
[0132] Step 2:
[0133] The server receives audio data sent from the terminal and analyzes the audio using digital signal processing technology. It identifies characteristic sounds, such as the sirens of emergency vehicles, and detects emergencies.
[0134] Step 3:
[0135] The server estimates the direction and distance based on the detected emergency sound. It analyzes the sound intensity and direction to determine the optimal driving instructions for the user.
[0136] Step 4:
[0137] The server uses an emotion engine to collect data to analyze the user's current emotional state. This data is obtained from in-car cameras and sensors, and the system estimates emotions by analyzing the user's facial expressions and physiological responses.
[0138] Step 5:
[0139] The server takes the user's emotional state into consideration and customizes driving instructions for emergencies. If the emotion engine detects the user's stress or anxiety, it generates calming sign language videos or audio guidance.
[0140] Step 6:
[0141] The generated sign language video and audio feedback are immediately transmitted from the server to the terminal. The terminal displays this on a head-up display, providing the user with driving instructions.
[0142] Step 7:
[0143] The user checks the displayed sign language video and audio feedback and operates the vehicle according to the instructions. For example, if an evasive maneuver to the right is required, the user carefully turns the steering wheel to the right.
[0144] This series of processes enables safe and accurate driving assistance while taking the user's emotions into consideration.
[0145] (Example 2)
[0146] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0147] Auditory assistance technologies often fail to adequately communicate information to the hearing impaired or those with hearing limitations in emergency situations, lacking means to provide safety and a sense of security. This invention enhances visual guidance and enables safe movement by analyzing ambient acoustic information in real time and providing appropriate information visually, taking into account emotional states.
[0148] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0149] In this invention, the server includes acoustic collection means for collecting and transmitting acoustic information, acoustic analysis means for performing analysis, and emotion monitoring means for estimating emotional states. This enables users to move and respond safely while reducing stress even in emergency situations.
[0150] An "acoustic acquisition means" is a device for acquiring ambient acoustic information and transmitting it to a data processing device as needed.
[0151] "Acoustic analysis means" refers to a processing device that analyzes received acoustic information and identifies acoustic patterns that indicate a specific event.
[0152] "Emotion monitoring means" refers to a detection device and analysis system for determining a user's emotional state based on their biometric information.
[0153] A "guidance generation means" is a function that generates guidance and instructions suitable for the user based on the results of acoustic and emotional analysis.
[0154] "Display control means" refers to a control device for effectively presenting generated visual information on a display device.
[0155] "Position estimation means" refers to a technique for estimating the spatial location of a sound source using acoustic information.
[0156] A "sign language generation method" is a technology that converts audiovisual information into sign language and provides users with visual instructions.
[0157] This invention is an in-vehicle system for providing emergency information to users with hearing impairments. It analyzes acoustic information and the user's emotional state to provide appropriate instructions visually. The specific configuration for carrying out this invention is described below.
[0158] The system primarily consists of terminals and servers. First, the terminals are installed inside the vehicle and are equipped with microphones and cameras. The terminals collect ambient acoustic information and transmit it to the server in real time. Standard wireless communication technology is used for this communication.
[0159] The server is an advanced data processing device that analyzes the received acoustic data. Machine learning algorithms are used as the acoustic analysis means to identify specific acoustic patterns, such as the sound of an emergency vehicle siren. Common cloud analysis services can be used for this process. If an emergency is recognized as a result of the analysis, the server uses position estimation means to estimate the direction and distance of the sound source.
[0160] Furthermore, the device collects user facial expression data and biometric information, such as heart rate, and transmits it to the server. The server uses emotion monitoring to analyze the user's emotional state and assesses the user's stress and anxiety. Based on this emotional data, the server uses guidance generation to create calm and appropriate instructions for the user. It utilizes a generative AI model to adjust the tone and content of the instructions using prompt sentences.
[0161] Instructions are generated as sign language or visual information and projected as images onto the terminal's display device. Based on this visual information, users can safely avoid emergency vehicles. This allows users with hearing impairments to continue driving with peace of mind.
[0162] As a concrete example, suppose the device detects an ambulance siren and senses anxiety from the user's facial expression. In this case, the server displays the sign language message "Move to the right" in calm colors on the head-up display. At this time, the prompt message is set to "Generate a gentle instruction to move aside for emergency vehicles." This allows the user to respond appropriately.
[0163] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0164] Step 1:
[0165] The terminal uses an acoustic acquisition device to acquire acoustic information from the microphone as input. The collected acoustic data is subjected to noise reduction processing and output as clear audio data. This data is transmitted to the server using a communication device.
[0166] Step 2:
[0167] The server uses acoustic data received via acoustic analysis tools to identify specific acoustic patterns, such as emergency vehicle sirens. Here, machine learning algorithms are applied and compared with existing acoustic libraries to obtain highly accurate analysis output. The analyzed results then proceed to the next step.
[0168] Step 3:
[0169] The server estimates the spatial location of a sound source based on acoustic information. Using the analysis results as input, it calculates the direction and distance of the sound source using triangulation and outputs it as estimated location data. This data is used as the basic data for generating driving instructions.
[0170] Step 4:
[0171] The device uses emotion monitoring to collect user facial expression data and biometric information such as heart rate as input. This data is transmitted to the server in real time.
[0172] Step 5:
[0173] The server takes collected biometric information as input to analyze emotional states and performs analysis using an emotion monitoring system. The data outputs numerical representations of emotional states such as stress and anxiety. These analysis results are used for induction generation.
[0174] Step 6:
[0175] The server uses a generative AI model to generate appropriate driving instructions based on acoustic and emotional data. It sets prompt sentences and creates instructions such as "Swerve to the right in a calm tone," which are then output as derived content.
[0176] Step 7:
[0177] The terminal receives the generated instructions and outputs them as visual information to the head-up display using a display control mechanism. The user confirms this visual information and performs the driving operations appropriately. This enables safe and secure travel.
[0178] (Application Example 2)
[0179] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".
[0180] Conventional driver assistance systems for the hearing impaired are specialized in detecting voice information and are unable to provide real-time driving instructions that take into account the user's emotional state. Therefore, the challenge is to provide an environment in which users can operate the vehicle safely and smoothly without feeling anxiety or stress in emergency situations.
[0181] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0182] In this invention, the server includes an acoustic acquisition means for detecting sounds indicating an emergency, an emotion analysis means for evaluating the user's emotions, and a video creation means for visualizing the instructions. This makes it possible to detect an emergency and simultaneously provide appropriate instructions in real time that correspond to the user's emotions.
[0183] "Sound acquisition means" refers to devices or mechanisms for collecting sounds from around a vehicle.
[0184] "Sound wave processing means" refers to a technology or process that analyzes collected sound data and identifies sound characteristics that indicate an emergency.
[0185] An "emotion analysis tool" is a system or algorithm that evaluates and analyzes a user's emotional state based on their facial expressions and biometric information.
[0186] "Video creation means" refers to technologies and devices that visually represent instructions provided to the user based on analyzed information.
[0187] "Display control means" refers to technologies and mechanisms for appropriately presenting generated visual instructions on a display device inside a vehicle.
[0188] A "driver assistance system" is a general term for technical devices and programs that assist drivers in safe driving.
[0189] "Sound source estimation means" refers to technologies and devices that estimate the direction and distance of detected sound and provide that information.
[0190] "Instructional display using sign language" refers to a technology or method for generating and communicating visual instructions based on sign language to users.
[0191] "Vibration feedback" is a feedback method that uses vibration to transmit information to the user through touch.
[0192] This driver assistance system is implemented using hardware such as microphones and cameras installed in the vehicle. Specifically, the system uses acoustic acquisition means to collect ambient sounds. The collected sound data is transmitted to a server and analyzed by sound wave processing means. Through this analysis, the server detects characteristic sounds such as emergency vehicle sirens. In addition, emotion analysis means are used to analyze the user's emotional state based on data from cameras and biosensors. For example, a model utilizing TensorFlow analyzes facial expressions.
[0193] Based on the analysis results and emotional information, the video creation system generates appropriate visual instructions. The generated instructions are expressed in the form of sign language videos or vibration feedback and provided to the driver. This information is presented on the in-vehicle display device by the display control system. For example, an instruction such as "Turn right to avoid the obstacle" is displayed on the screen in a gentle tone using sign language, and vibration feedback is sent to the user's smartwatch.
[0194] In this way, drivers receive real-time safe driving assistance, allowing them to drive with peace of mind while reducing anxiety and tension. An example of a prompt message generated using the AI model is: "An emergency vehicle is approaching. Turn right gently. Generate instructions in sign language and vibration that take the user's emotions into consideration."
[0195] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0196] Step 1:
[0197] The terminal acquires ambient sounds through a microphone mounted on the vehicle. The acquired audio data is transmitted from the terminal to a server via the network. This data is primarily used as analysis material for identifying emergency sounds.
[0198] Step 2:
[0199] The server analyzes the received audio data using sound wave processing equipment. Using software such as Librosa, it extracts features from the sound wave data and detects specific acoustic patterns, such as emergency vehicle sirens. This analysis determines whether an emergency situation exists. The output provides the direction of the sound source and the presence or absence of an emergency.
[0200] Step 3:
[0201] The server acquires the user's biometric data from a camera or smartwatch. Using emotion analysis tools, it analyzes the user's emotional state from facial expression data using deep learning models such as TensorFlow. The analysis results output the user's emotional state (e.g., stress level and anxiety level).
[0202] Step 4:
[0203] The server visualizes the instructions using video creation tools based on the results of acoustic analysis and emotional state data. The video creation process generates gentle sign language videos appropriate to the user's emotional state. The output is video data containing the instructions.
[0204] Step 5:
[0205] The terminal displays the generated video data on the vehicle's head-up display. Using a display control system, the user is presented with instructions in a gentle tone, translated into sign language, and vibration feedback. As a result, the user can safely perform driving operations based on the displayed information.
[0206] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.
[0207] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0208] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.
[0209] [Second Embodiment]
[0210] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.
[0211] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.
[0212] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0213] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.
[0214] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0215] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0216] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0217] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0218] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0219] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0220] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0221] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0222] This invention is a system for providing emergency information to visually impaired drivers using sound, and involves a server, a terminal, and a user.
[0223] First, the terminal continuously collects ambient sounds via microphones installed in the vehicle. The collected audio data is transmitted to a server. The server analyzes the received audio data and performs digital signal processing to identify specific sounds, such as the siren of an emergency vehicle. If this acoustic analysis detects the approach of an emergency vehicle, the system proceeds to the next stage.
[0224] From the detected acoustic data, the server estimates the direction and distance. Based on this, a video generation program within the server generates sign language video instructions for the driver. These videos include specific instructions such as "turn right," "turn left," and "stop."
[0225] The generated sign language video is transmitted directly to the terminal. The terminal immediately displays this video on the vehicle's head-up display, providing the user with visual information. Based on the displayed sign language video, the user can quickly perform driving maneuvers and take appropriate actions to safely avoid emergency vehicles.
[0226] For example, if the terminal detects the sound of an emergency vehicle's siren, the server analyzes the direction of the sound and confirms that it is approaching from the right rear. Based on this, the server generates a sign language video instructing the user to "turn right" and sends it to the terminal. The user then confirms this, appropriately moves their vehicle to the right, and yields to the emergency vehicle.
[0227] This system enables drivers with hearing impairments to respond quickly and safely to emergency vehicles, providing driving assistance.
[0228] The following describes the processing flow.
[0229] Step 1:
[0230] The terminal uses a microphone mounted on the vehicle to continuously collect ambient sounds. The collected audio data undergoes noise reduction processing before being sent to a server to improve the accuracy of sound detection.
[0231] Step 2:
[0232] The server receives audio data transmitted from the terminal. Using a digital signal processing algorithm, the server analyzes the audio data and detects specific sound patterns, such as emergency vehicle sirens.
[0233] Step 3:
[0234] When the sound of an emergency vehicle's siren is detected, the server estimates its direction and distance. This estimation utilizes changes in the siren's intensity and frequency.
[0235] Step 4:
[0236] The server assesses the urgency of the situation and determines the necessary driving instructions based on that assessment. These instructions include options such as "turn right," "turn left," and "stop," and the most appropriate one is selected depending on the situation.
[0237] Step 5:
[0238] The AI on the server generates sign language videos based on the determined driving instructions. These sign language videos use actions and symbols to express the instructions and are designed to be easily understood visually.
[0239] Step 6:
[0240] The generated sign language video is transmitted to the terminal in real time. A communication protocol is used to minimize delays, enabling drivers to respond quickly.
[0241] Step 7:
[0242] The terminal receives sign language video sent from the server and immediately displays it on the head-up display. This allows the user to visually receive driving instructions.
[0243] Step 8:
[0244] The user checks the displayed sign language video and performs driving operations accordingly. For example, if the instruction is "turn right," the user turns the steering wheel to the right and takes appropriate action to avoid an emergency vehicle.
[0245] (Example 1)
[0246] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0247] It is difficult for visually impaired drivers to recognize surrounding emergencies appropriately and quickly. Therefore, there is a need for effective means of providing information to warn of approaching emergency vehicles. Furthermore, conventional information identification and presentation technologies are prone to delays and inaccuracies in information, which can impair safe driving.
[0248] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0249] In this invention, the server includes an acoustic collection means for continuously collecting ambient sounds, an acoustic analysis means for analyzing the transmitted sound data and identifying specific sounds, a position estimation means for estimating the direction and distance of sounds from the analysis results, and a video generation means for generating instructions as video. This makes it possible to quickly and accurately notify visually impaired drivers of emergencies through visual information.
[0250] "Acoustic acquisition means" refers to a device or system for continuously acquiring ambient sounds.
[0251] "Data transmission means" refers to a method or technique for transmitting collected acoustic information to another device or system.
[0252] "Acoustic analysis means" refers to a technology or device for analyzing sound data and identifying specific sounds from it.
[0253] "Position estimation means" refers to a method or system for estimating the direction and distance of a sound source.
[0254] "Image generation means" refers to a technology or device for creating specific instruction information as a visual image.
[0255] "Display means" refers to a device or system for visually showing the generated image to the user.
[0256] This invention is a support system for quickly notifying visually impaired drivers of approaching emergency vehicles through audio and visual information.
[0257] Hardware and software for implementation
[0258] First, the terminal functions as an audio collection device installed in the vehicle. This device incorporates a high-sensitivity microphone to collect ambient sounds in real time. The collected audio information is transmitted to a server via a network module. To optimize network transmission, the device often utilizes Wi-Fi or mobile communication technologies.
[0259] The server is equipped with acoustic analysis software to analyze the received audio data. This analysis utilizes acoustic signal processing libraries, with Librosa being a specific example. Using this library, the server performs frequency analysis to identify the siren sound of a particular emergency vehicle.
[0260] Based on the audio analysis results, the server also runs a position estimation algorithm to estimate the direction and distance of the sound source. This algorithm employs a technique that uses the time difference of the acoustic signals.
[0261] To visualize the instructions, the server generates driving instructions for the user as sign language videos, based on a generation AI model. By utilizing AI generation technology in this process, video generation is performed quickly.
[0262] The generated video is sent back to the terminal and displayed on the vehicle's head-up display. This allows the user to take intuitive driving actions.
[0263] Specific example
[0264] For example, if the microphone picks up the sound of an emergency vehicle's siren while the user is driving, the server will analyze the results to indicate that the sound is approaching from the right rear. Based on this, a sign language video instructing the user to "turn right" is generated and displayed on the device. The user can then confirm this and quickly move their vehicle to the right to safely yield to the emergency vehicle.
[0265] Example of a prompt
[0266] "It analyzes surrounding audio data to detect the approach of emergency vehicles. Based on the direction and distance of the sound, it generates driving instructions in sign language and notifies the driver."
[0267] Based on the above, this invention can provide important support for visually impaired drivers to drive safely.
[0268] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0269] Step 1:
[0270] The terminal continuously collects ambient sounds using a high-sensitivity microphone installed in the vehicle. The input for this process is audio signals from the external environment, and the output is digital audio data. This digital audio data is temporarily stored within the terminal before being compressed and encrypted.
[0271] Step 2:
[0272] The device compresses the collected audio data and applies a compression algorithm to enable efficient network transmission. Next, the compressed data is encrypted to protect its confidentiality. The output of this process is encrypted and compressed audio data in a transferable format. This data is then sent to the server using Wi-Fi or 4G / 5G communication.
[0273] Step 3:
[0274] The server decrypts and removes encryption from the audio data received over the network. The input is encrypted, compressed audio data, and the output is uncompressed audio data in a format that can be processed. Acoustic analysis is then performed using this data.
[0275] Step 4:
[0276] The server performs acoustic analysis on uncompressed audio data. Here, the Librosa library is used to analyze the frequency spectrum and detect emergency vehicle siren sounds. The input is uncompressed audio data, and the output is a determination of whether or not it contains sounds of a specific emergency. Further processing is then carried out based on this result.
[0277] Step 5:
[0278] When a specific emergency sound is detected, the server estimates the direction and distance of the sound source. In this step, sound localization techniques are used to calculate the direction, followed by further data processing. The input is time-delay information of the audio signal, and the output is the estimated direction and distance of the sound source.
[0279] Step 6:
[0280] The server generates instructions for the user based on estimated sound source information. Here, a generation AI model is used to generate sign language videos to visually instruct the driver. The input is the direction and distance of the sound source, and the output is video data corresponding to the instruction content.
[0281] Step 7:
[0282] The generated sign language video is transmitted to the terminal. The terminal receives the video data and displays it on the head-up display inside the vehicle. The input is video data, and the output is a visual notification to the user.
[0283] Step 8:
[0284] The user checks the sign language video displayed on the head-up display and performs appropriate driving operations based on it. As a result, the user can safely respond to emergency vehicles. The input is the sign language video, and the output is the driving action following the instructions.
[0285] (Application Example 1)
[0286] Next, Application Example 1 will be described. In the following description, the data processing device 12 is referred to as the "server", and the smart glasses 214 are referred to as the "terminal".
[0287] There is a problem that it is difficult for drivers with visual or auditory constraints to accurately recognize emergency situations occurring around them and respond quickly and safely. Therefore, there is a need for a system that accurately notifies drivers of important situations such as the approach of emergency vehicles and supports safe driving.
[0288] The specific processing by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0289] In this invention, the server includes a detection means for detecting sound, an analysis means for analyzing the characteristics of the sound, and a generation means for visualizing the instruction content based on the analysis result. As a result, even drivers with visual or auditory constraints can achieve accurate situation recognition and appropriate driving operations.
[0290] The "detection means for detecting sound" is a device or system that senses ambient sound and acquires it as digital data.
[0291] The "analysis means for analyzing the characteristics of the sound" is a device or system responsible for analyzing the characteristics of the acquired audio data and identifying specific sounds or patterns.
[0292] The "generation means for visualizing the instruction content based on the analysis result" is a device or system that generates information in a form that visually conveys the instructions derived from the analyzed audio data to the driver.
[0293] "Estimation means for estimating direction and distance" refers to a device or system that measures the direction and distance of a sound source from audio data and presents that positional relationship to the driver.
[0294] "Generating visual information using sign language instructions" means expressing specific instructions as images using sign language as a visual method.
[0295] This invention is a system in which a terminal installed in a vehicle while it is in motion senses ambient sounds, analyzes the situation such as the approach of an emergency vehicle using voice analysis, and provides visual information to the driver.
[0296] First, the device collects ambient sounds from the vehicle in order to detect sound. Specifically, it acquires surrounding audio data using the microphone of a smartphone or a dedicated device.
[0297] Next, the server receives the audio data and analyzes its characteristics. Using speech recognition technologies such as Google Cloud Speech-to-Text API and Amazon Transcribe, it analyzes the audio signal in real time. This allows it to recognize and identify the sound of emergency vehicle sirens.
[0298] Based on the analysis results, the instructions are visualized using a generative AI model. A deep learning framework such as TensorFlow is used to generate instructional videos, including sign language. These videos are transmitted to smartphones and other display devices. The driver then uses this visual information to perform appropriate driving maneuvers.
[0299] For example, if the sound of an emergency vehicle's siren is approaching from the right rear, the server analyzes the data and generates a specific sign language video indicating "turn right," which is then displayed on the terminal. This information allows the driver to quickly move their vehicle to the right and yield to the emergency vehicle.
[0300] Examples of prompt texts include instructions such as "Use the microphone of a smartphone to analyze the surrounding sounds, identify the siren sound of an emergency vehicle, and generate sign language videos based on the direction of the sound."
[0301] The flow of the specific process in Application Example 1 will be described using FIG. 12.
[0302] Step 1:
[0303] The terminal uses the microphone of a smartphone or device to collect the surrounding environmental sounds. The input is an audio signal, and audio data is generated as the output. This data is transmitted to the server in real time. Specifically, it performs operations such as capturing the noise inside the vehicle and external sounds and converting them into a digital format.
[0304] Step 2:
[0305] The server analyzes the received audio data. The input is the audio data generated in Step 1, and the output is the specific patterns in the audio and the identification result of the sound. Here, Google Cloud Speech-to-Text API or the like is used to perform text analysis on the data and identify the siren sound of an emergency vehicle, etc. Specifically, it includes operations such as analyzing the frequency characteristics and volume changes in the data.
[0306] Step 3:
[0307] The server generates an instruction video based on the analysis result. The input is the sound identification result obtained in Step 2, and the output is an instruction video using sign language. A generative AI model such as TensorFlow is used to visualize specific instructions and generate the video. As a specific operation, rendering of sign language frames based on the analysis result is performed.
[0308] Step 4:
[0309] The server sends the generated instruction video to the terminal. The input is the visual information created in step 3, and the output is the video that can be viewed on the terminal. The terminal then performs the specific action of projecting this video onto a display device. This provides the user with real-time visual instructions.
[0310] Step 5:
[0311] The user performs driving operations based on sign language video displayed on the terminal. The input is the sign language video displayed in step 4, and the output is the change in the user's driving behavior. Specifically, actions are taken to ensure safety, such as appropriately changing lanes in response to the approach of an emergency vehicle.
[0312] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0313] This invention combines a system that uses sound to provide emergency information to the hearing impaired with an emotion engine that recognizes the user's emotions and provides appropriate feedback. This enables driving assistance that takes the user's mental state into consideration.
[0314] First, the terminal collects ambient sounds through a microphone installed in the vehicle. The collected audio data is sent to a server via the network. The server analyzes this audio data in detail and detects sound patterns such as the sirens of emergency vehicles. When an emergency is detected through this analysis, the server estimates the direction and distance of the sound source and determines the necessary driving instructions.
[0315] In addition to acoustic analysis, the emotion engine receives input from the user's facial expressions and body sensors to understand the user's current emotional state. Based on this data, the server analyzes the user's emotions and provides appropriate feedback if stress or anxiety is elevated.
[0316] For example, if the server recognizes that an emergency vehicle is approaching from behind and the emotion engine determines that the user is feeling anxious, the video generation program will convert the instruction to "turn right" into sign language in a calm and soothing tone and display it on the terminal. In this case, easier-to-understand expressions or voice guidance may also be added.
[0317] The terminal immediately displays the generated sign language video and audio feedback on the head-up display, allowing the user to perform driving maneuvers to avoid emergency vehicles based on the provided instructions.
[0318] Thus, the present invention makes it possible to provide a safe and comfortable driving environment through advanced responses that reflect the user's emotional state.
[0319] The following describes the processing flow.
[0320] Step 1:
[0321] The terminal continuously collects ambient sounds through microphones placed in the vehicle. This collected audio data is transmitted to a server in real time.
[0322] Step 2:
[0323] The server receives audio data sent from the terminal and analyzes the audio using digital signal processing technology. It identifies characteristic sounds, such as the sirens of emergency vehicles, and detects emergencies.
[0324] Step 3:
[0325] The server estimates the direction and distance based on the detected emergency sound. It analyzes the sound intensity and direction to determine the optimal driving instructions for the user.
[0326] Step 4:
[0327] The server uses an emotion engine to collect data to analyze the user's current emotional state. This data is obtained from in-car cameras and sensors, and the system estimates emotions by analyzing the user's facial expressions and physiological responses.
[0328] Step 5:
[0329] The server takes the user's emotional state into consideration and customizes driving instructions for emergencies. If the emotion engine detects the user's stress or anxiety, it generates calming sign language videos or audio guidance.
[0330] Step 6:
[0331] The generated sign language video and audio feedback are immediately transmitted from the server to the terminal. The terminal displays this on a head-up display, providing the user with driving instructions.
[0332] Step 7:
[0333] The user checks the displayed sign language video and audio feedback and operates the vehicle according to the instructions. For example, if an evasive maneuver to the right is required, the user carefully turns the steering wheel to the right.
[0334] This series of processes enables safe and accurate driving assistance while taking the user's emotions into consideration.
[0335] (Example 2)
[0336] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0337] Auditory assistance technologies often fail to adequately communicate information to the hearing impaired or those with hearing limitations in emergency situations, lacking means to provide safety and a sense of security. This invention enhances visual guidance and enables safe movement by analyzing ambient acoustic information in real time and providing appropriate information visually, taking into account emotional states.
[0338] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0339] In this invention, the server includes acoustic collection means for collecting and transmitting acoustic information, acoustic analysis means for performing analysis, and emotion monitoring means for estimating emotional states. This enables users to move and respond safely while reducing stress even in emergency situations.
[0340] An "acoustic acquisition means" is a device for acquiring ambient acoustic information and transmitting it to a data processing device as needed.
[0341] "Acoustic analysis means" refers to a processing device that analyzes received acoustic information and identifies acoustic patterns that indicate a specific event.
[0342] "Emotion monitoring means" refers to a detection device and analysis system for determining a user's emotional state based on their biometric information.
[0343] A "guidance generation means" is a function that generates guidance and instructions suitable for the user based on the results of acoustic and emotional analysis.
[0344] "Display control means" refers to a control device for effectively presenting generated visual information on a display device.
[0345] "Position estimation means" refers to a technique for estimating the spatial location of a sound source using acoustic information.
[0346] A "sign language generation method" is a technology that converts audiovisual information into sign language and provides users with visual instructions.
[0347] This invention is an in-vehicle system for providing emergency information to users with hearing impairments. It analyzes acoustic information and the user's emotional state to provide appropriate instructions visually. The specific configuration for carrying out this invention is described below.
[0348] The system primarily consists of terminals and servers. First, the terminals are installed inside the vehicle and are equipped with microphones and cameras. The terminals collect ambient acoustic information and transmit it to the server in real time. Standard wireless communication technology is used for this communication.
[0349] The server is an advanced data processing device that analyzes the received acoustic data. Machine learning algorithms are used as the acoustic analysis means to identify specific acoustic patterns, such as the sound of an emergency vehicle siren. Common cloud analysis services can be used for this process. If an emergency is recognized as a result of the analysis, the server uses position estimation means to estimate the direction and distance of the sound source.
[0350] Furthermore, the device collects user facial expression data and biometric information, such as heart rate, and transmits it to the server. The server uses emotion monitoring to analyze the user's emotional state and assesses the user's stress and anxiety. Based on this emotional data, the server uses guidance generation to create calm and appropriate instructions for the user. It utilizes a generative AI model to adjust the tone and content of the instructions using prompt sentences.
[0351] Instructions are generated as sign language or visual information and projected as images onto the terminal's display device. Based on this visual information, users can safely avoid emergency vehicles. This allows users with hearing impairments to continue driving with peace of mind.
[0352] As a concrete example, suppose the device detects an ambulance siren and senses anxiety from the user's facial expression. In this case, the server displays the sign language message "Move to the right" in calm colors on the head-up display. At this time, the prompt message is set to "Generate a gentle instruction to move aside for emergency vehicles." This allows the user to respond appropriately.
[0353] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0354] Step 1:
[0355] The terminal uses an acoustic acquisition device to acquire acoustic information from the microphone as input. The collected acoustic data is subjected to noise reduction processing and output as clear audio data. This data is transmitted to the server using a communication device.
[0356] Step 2:
[0357] The server uses acoustic data received via acoustic analysis tools to identify specific acoustic patterns, such as emergency vehicle sirens. Here, machine learning algorithms are applied and compared with existing acoustic libraries to obtain highly accurate analysis output. The analyzed results then proceed to the next step.
[0358] Step 3:
[0359] The server estimates the spatial location of a sound source based on acoustic information. Using the analysis results as input, it calculates the direction and distance of the sound source using triangulation and outputs it as estimated location data. This data is used as the basic data for generating driving instructions.
[0360] Step 4:
[0361] The device uses emotion monitoring to collect user facial expression data and biometric information such as heart rate as input. This data is transmitted to the server in real time.
[0362] Step 5:
[0363] The server takes collected biometric information as input to analyze emotional states and performs analysis using an emotion monitoring system. The data outputs numerical representations of emotional states such as stress and anxiety. These analysis results are used for induction generation.
[0364] Step 6:
[0365] The server uses a generative AI model to generate appropriate driving instructions based on acoustic and emotional data. It sets prompt sentences and creates instructions such as "Swerve to the right in a calm tone," which are then output as derived content.
[0366] Step 7:
[0367] The terminal receives the generated instructions and outputs them as visual information to the head-up display using a display control mechanism. The user confirms this visual information and performs the driving operations appropriately. This enables safe and secure travel.
[0368] (Application Example 2)
[0369] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the smart glasses 214 as the "terminal".
[0370] Conventional driver assistance systems for the hearing impaired are specialized in detecting voice information and are unable to provide real-time driving instructions that take into account the user's emotional state. Therefore, the challenge is to provide an environment in which users can operate the vehicle safely and smoothly without feeling anxiety or stress in emergency situations.
[0371] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0372] In this invention, the server includes an acoustic acquisition means for detecting sounds indicating an emergency, an emotion analysis means for evaluating the user's emotions, and a video creation means for visualizing the instructions. This makes it possible to detect an emergency and simultaneously provide appropriate instructions in real time that correspond to the user's emotions.
[0373] "Sound acquisition means" refers to devices or mechanisms for collecting sounds from around a vehicle.
[0374] "Sound wave processing means" refers to a technology or process that analyzes collected sound data and identifies sound characteristics that indicate an emergency.
[0375] An "emotion analysis tool" is a system or algorithm that evaluates and analyzes a user's emotional state based on their facial expressions and biometric information.
[0376] "Video creation means" refers to technologies and devices that visually represent instructions provided to the user based on analyzed information.
[0377] "Display control means" refers to technologies and mechanisms for appropriately presenting generated visual instructions on a display device inside a vehicle.
[0378] A "driver assistance system" is a general term for technical devices and programs that assist drivers in safe driving.
[0379] "Sound source estimation means" refers to technologies and devices that estimate the direction and distance of detected sound and provide that information.
[0380] "Instructional display using sign language" refers to a technology or method for generating and communicating visual instructions based on sign language to users.
[0381] "Vibration feedback" is a feedback method that uses vibration to transmit information to the user through touch.
[0382] This driver assistance system is implemented using hardware such as microphones and cameras installed in the vehicle. Specifically, the system uses acoustic acquisition means to collect ambient sounds. The collected sound data is transmitted to a server and analyzed by sound wave processing means. Through this analysis, the server detects characteristic sounds such as emergency vehicle sirens. In addition, emotion analysis means are used to analyze the user's emotional state based on data from cameras and biosensors. For example, a model utilizing TensorFlow analyzes facial expressions.
[0383] Based on the analysis results and emotional information, the video creation system generates appropriate visual instructions. The generated instructions are expressed in the form of sign language videos or vibration feedback and provided to the driver. This information is presented on the in-vehicle display device by the display control system. For example, an instruction such as "Turn right to avoid the obstacle" is displayed on the screen in a gentle tone using sign language, and vibration feedback is sent to the user's smartwatch.
[0384] In this way, drivers receive real-time safe driving assistance, allowing them to drive with peace of mind while reducing anxiety and tension. An example of a prompt message generated using the AI model is: "An emergency vehicle is approaching. Turn right gently. Generate instructions in sign language and vibration that take the user's emotions into consideration."
[0385] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0386] Step 1:
[0387] The terminal acquires ambient sounds through a microphone mounted on the vehicle. The acquired audio data is transmitted from the terminal to a server via the network. This data is primarily used as analysis material for identifying emergency sounds.
[0388] Step 2:
[0389] The server analyzes the received audio data using sound wave processing equipment. Using software such as Librosa, it extracts features from the sound wave data and detects specific acoustic patterns, such as emergency vehicle sirens. This analysis determines whether an emergency situation exists. The output provides the direction of the sound source and the presence or absence of an emergency.
[0390] Step 3:
[0391] The server acquires the user's biometric data from a camera or smartwatch. Using emotion analysis tools, it analyzes the user's emotional state from facial expression data using deep learning models such as TensorFlow. The analysis results output the user's emotional state (e.g., stress level and anxiety level).
[0392] Step 4:
[0393] The server visualizes the instructions using video creation tools based on the results of acoustic analysis and emotional state data. The video creation process generates gentle sign language videos appropriate to the user's emotional state. The output is video data containing the instructions.
[0394] Step 5:
[0395] The terminal displays the generated video data on the vehicle's head-up display. Using a display control system, the user is presented with instructions in a gentle tone, translated into sign language, and vibration feedback. As a result, the user can safely perform driving operations based on the displayed information.
[0396] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0397] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0398] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.
[0399] [Third Embodiment]
[0400] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.
[0401] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.
[0402] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0403] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.
[0404] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0405] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0406] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0407] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0408] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0409] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0410] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0411] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".
[0412] This invention is a system for providing emergency information to visually impaired drivers using sound, and involves a server, a terminal, and a user.
[0413] First, the terminal continuously collects ambient sounds via microphones installed in the vehicle. The collected audio data is transmitted to a server. The server analyzes the received audio data and performs digital signal processing to identify specific sounds, such as the siren of an emergency vehicle. If this acoustic analysis detects the approach of an emergency vehicle, the system proceeds to the next stage.
[0414] From the detected acoustic data, the server estimates the direction and distance. Based on this, a video generation program within the server generates sign language video instructions for the driver. These videos include specific instructions such as "turn right," "turn left," and "stop."
[0415] The generated sign language video is transmitted directly to the terminal. The terminal immediately displays this video on the vehicle's head-up display, providing the user with visual information. Based on the displayed sign language video, the user can quickly perform driving maneuvers and take appropriate actions to safely avoid emergency vehicles.
[0416] For example, if the terminal detects the sound of an emergency vehicle's siren, the server analyzes the direction of the sound and confirms that it is approaching from the right rear. Based on this, the server generates a sign language video instructing the user to "turn right" and sends it to the terminal. The user then confirms this, appropriately moves their vehicle to the right, and yields to the emergency vehicle.
[0417] This system enables drivers with hearing impairments to respond quickly and safely to emergency vehicles, providing driving assistance.
[0418] The following describes the processing flow.
[0419] Step 1:
[0420] The terminal uses a microphone mounted on the vehicle to continuously collect ambient sounds. The collected audio data undergoes noise reduction processing before being sent to a server to improve the accuracy of sound detection.
[0421] Step 2:
[0422] The server receives audio data transmitted from the terminal. Using a digital signal processing algorithm, the server analyzes the audio data and detects specific sound patterns, such as emergency vehicle sirens.
[0423] Step 3:
[0424] When the sound of an emergency vehicle's siren is detected, the server estimates its direction and distance. This estimation utilizes changes in the siren's intensity and frequency.
[0425] Step 4:
[0426] The server assesses the urgency of the situation and determines the necessary driving instructions based on that assessment. These instructions include options such as "turn right," "turn left," and "stop," and the most appropriate one is selected depending on the situation.
[0427] Step 5:
[0428] The AI on the server generates sign language videos based on the determined driving instructions. These sign language videos use actions and symbols to express the instructions and are designed to be easily understood visually.
[0429] Step 6:
[0430] The generated sign language video is transmitted to the terminal in real time. A communication protocol is used to minimize delays, enabling drivers to respond quickly.
[0431] Step 7:
[0432] The terminal receives sign language video sent from the server and immediately displays it on the head-up display. This allows the user to visually receive driving instructions.
[0433] Step 8:
[0434] The user checks the displayed sign language video and performs driving operations accordingly. For example, if the instruction is "turn right," the user turns the steering wheel to the right and takes appropriate action to avoid an emergency vehicle.
[0435] (Example 1)
[0436] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0437] It is difficult for visually impaired drivers to recognize surrounding emergencies appropriately and quickly. Therefore, there is a need for effective means of providing information to warn of approaching emergency vehicles. Furthermore, conventional information identification and presentation technologies are prone to delays and inaccuracies in information, which can impair safe driving.
[0438] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0439] In this invention, the server includes an acoustic collection means for continuously collecting ambient sounds, an acoustic analysis means for analyzing the transmitted sound data and identifying specific sounds, a position estimation means for estimating the direction and distance of sounds from the analysis results, and a video generation means for generating instructions as video. This makes it possible to quickly and accurately notify visually impaired drivers of emergencies through visual information.
[0440] "Acoustic acquisition means" refers to a device or system for continuously acquiring ambient sounds.
[0441] "Data transmission means" refers to a method or technique for transmitting collected acoustic information to another device or system.
[0442] "Acoustic analysis means" refers to a technology or device for analyzing sound data and identifying specific sounds from it.
[0443] "Position estimation means" refers to a method or system for estimating the direction and distance of a sound source.
[0444] "Image generation means" refers to a technology or device for creating specific instruction information as a visual image.
[0445] "Display means" refers to a device or system for visually showing the generated image to the user.
[0446] This invention is a support system for quickly notifying visually impaired drivers of approaching emergency vehicles through audio and visual information.
[0447] Hardware and software for implementation
[0448] First, the terminal functions as an audio collection device installed in the vehicle. This device incorporates a high-sensitivity microphone to collect ambient sounds in real time. The collected audio information is transmitted to a server via a network module. To optimize network transmission, the device often utilizes Wi-Fi or mobile communication technologies.
[0449] The server is equipped with acoustic analysis software to analyze the received audio data. This analysis utilizes acoustic signal processing libraries, with Librosa being a specific example. Using this library, the server performs frequency analysis to identify the siren sound of a particular emergency vehicle.
[0450] Based on the audio analysis results, the server also runs a position estimation algorithm to estimate the direction and distance of the sound source. This algorithm employs a technique that uses the time difference of the acoustic signals.
[0451] To visualize the instructions, the server generates driving instructions for the user as sign language videos, based on a generation AI model. By utilizing AI generation technology in this process, video generation is performed quickly.
[0452] The generated video is sent back to the terminal and displayed on the vehicle's head-up display. This allows the user to take intuitive driving actions.
[0453] Specific example
[0454] For example, if the microphone picks up the sound of an emergency vehicle's siren while the user is driving, the server will analyze the results to indicate that the sound is approaching from the right rear. Based on this, a sign language video instructing the user to "turn right" is generated and displayed on the device. The user can then confirm this and quickly move their vehicle to the right to safely yield to the emergency vehicle.
[0455] Example of a prompt
[0456] "It analyzes surrounding audio data to detect the approach of emergency vehicles. Based on the direction and distance of the sound, it generates driving instructions in sign language and notifies the driver."
[0457] Based on the above, this invention can provide important support for visually impaired drivers to drive safely.
[0458] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0459] Step 1:
[0460] The terminal continuously collects ambient sounds using a high-sensitivity microphone installed in the vehicle. The input for this process is audio signals from the external environment, and the output is digital audio data. This digital audio data is temporarily stored within the terminal before being compressed and encrypted.
[0461] Step 2:
[0462] The device compresses the collected audio data and applies a compression algorithm to enable efficient network transmission. Next, the compressed data is encrypted to protect its confidentiality. The output of this process is encrypted and compressed audio data in a transferable format. This data is then sent to the server using Wi-Fi or 4G / 5G communication.
[0463] Step 3:
[0464] The server decrypts and removes encryption from the audio data received over the network. The input is encrypted, compressed audio data, and the output is uncompressed audio data in a format that can be processed. Acoustic analysis is then performed using this data.
[0465] Step 4:
[0466] The server performs acoustic analysis on uncompressed audio data. Here, the Librosa library is used to analyze the frequency spectrum and detect emergency vehicle siren sounds. The input is uncompressed audio data, and the output is a determination of whether or not it contains sounds of a specific emergency. Further processing is then carried out based on this result.
[0467] Step 5:
[0468] When a specific emergency sound is detected, the server estimates the direction and distance of the sound source. In this step, sound localization techniques are used to calculate the direction, followed by further data processing. The input is time-delay information of the audio signal, and the output is the estimated direction and distance of the sound source.
[0469] Step 6:
[0470] The server generates instructions for the user based on estimated sound source information. Here, a generation AI model is used to generate sign language videos to visually instruct the driver. The input is the direction and distance of the sound source, and the output is video data corresponding to the instruction content.
[0471] Step 7:
[0472] The generated sign language video is transmitted to the terminal. The terminal receives the video data and displays it on the head-up display inside the vehicle. The input is video data, and the output is a visual notification to the user.
[0473] Step 8:
[0474] The user views sign language images displayed on a head-up display and performs appropriate driving actions based on those images. This allows the user to safely respond to emergency vehicles. The input is sign language images, and the output is driving actions performed in accordance with the instructions.
[0475] (Application Example 1)
[0476] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0477] Drivers with visual or hearing impairments face challenges in accurately recognizing and responding quickly and safely to emergencies occurring in their surroundings. Therefore, there is a need for systems that accurately notify drivers of important situations, such as the approach of emergency vehicles, and support safe driving.
[0478] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0479] In this invention, the server includes detection means for detecting sound, analysis means for analyzing the characteristics of the sound, and generation means for visualizing the instruction content based on the analysis results. This enables drivers with visual or auditory limitations to accurately perceive the situation and perform appropriate driving operations.
[0480] A "sound detection means" refers to a device or system that senses ambient sounds and acquires them as digital data.
[0481] "Analysis means for analyzing the characteristics of the sound" refers to a device or system that is responsible for the process of analyzing the characteristics of acquired audio data and identifying specific sounds or patterns.
[0482] "Generating means for visualizing instruction content based on analysis results" refers to a device or system that generates information in a format that visually conveys the instructions derived from the analyzed voice data to the driver.
[0483] "Estimation means for estimating direction and distance" refers to a device or system that measures the direction and distance of a sound source from audio data and presents that positional relationship to the driver.
[0484] "Generating visual information using sign language instructions" means expressing specific instructions as images using sign language as a visual method.
[0485] This invention is a system in which a terminal installed in a vehicle while it is in motion senses ambient sounds, analyzes the situation such as the approach of an emergency vehicle using voice analysis, and provides visual information to the driver.
[0486] First, the device collects ambient sounds from the vehicle in order to detect sound. Specifically, it acquires surrounding audio data using the microphone of a smartphone or a dedicated device.
[0487] Next, the server receives the audio data and analyzes its characteristics. Using speech recognition technologies such as Google Cloud Speech-to-Text API and Amazon Transcribe, it analyzes the audio signal in real time. This allows it to recognize and identify the sound of emergency vehicle sirens.
[0488] Based on the analysis results, the instructions are visualized using a generative AI model. A deep learning framework such as TensorFlow is used to generate instructional videos, including sign language. These videos are transmitted to smartphones and other display devices. The driver then uses this visual information to perform appropriate driving maneuvers.
[0489] For example, if the sound of an emergency vehicle's siren is approaching from the right rear, the server analyzes the data and generates a specific sign language video indicating "turn right," which is then displayed on the terminal. This information allows the driver to quickly move their vehicle to the right and yield to the emergency vehicle.
[0490] An example of a prompt message would be: "Use your smartphone's microphone to analyze ambient sounds, identify emergency vehicle sirens, and generate sign language video based on the direction of the sound."
[0491] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0492] Step 1:
[0493] The device uses the microphone of a smartphone or other device to collect ambient sounds. The input is an audio signal, and the output is generated as audio data. This data is transmitted to a server in real time. Specifically, it captures noise inside the vehicle and external sounds and converts them into a digital format.
[0494] Step 2:
[0495] The server analyzes the received audio data. The input is the audio data generated in step 1, and the output is the identification of specific patterns or sounds within the audio. Here, the data is analyzed using tools such as the Google Cloud Speech-to-Text API to identify sounds like emergency vehicle sirens. Specifically, this involves analyzing the frequency characteristics and volume changes within the data.
[0496] Step 3:
[0497] The server generates instructional videos based on the analysis results. The input is the sound identification result obtained in step 2, and the output is instructional videos using sign language. Generative AI models such as TensorFlow are used to visualize specific instructions and generate videos. Specifically, the system renders sign language frames based on the analysis results.
[0498] Step 4:
[0499] The server sends the generated instruction video to the terminal. The input is the visual information created in step 3, and the output is the video that can be viewed on the terminal. The terminal then performs the specific action of projecting this video onto a display device. This provides the user with real-time visual instructions.
[0500] Step 5:
[0501] The user performs driving operations based on sign language video displayed on the terminal. The input is the sign language video displayed in step 4, and the output is the change in the user's driving behavior. Specifically, actions are taken to ensure safety, such as appropriately changing lanes in response to the approach of an emergency vehicle.
[0502] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0503] This invention combines a system that uses sound to provide emergency information to the hearing impaired with an emotion engine that recognizes the user's emotions and provides appropriate feedback. This enables driving assistance that takes the user's mental state into consideration.
[0504] First, the terminal collects ambient sounds through a microphone installed in the vehicle. The collected audio data is sent to a server via the network. The server analyzes this audio data in detail and detects sound patterns such as the sirens of emergency vehicles. When an emergency is detected through this analysis, the server estimates the direction and distance of the sound source and determines the necessary driving instructions.
[0505] In addition to acoustic analysis, the emotion engine receives input from the user's facial expressions and body sensors to understand the user's current emotional state. Based on this data, the server analyzes the user's emotions and provides appropriate feedback if stress or anxiety is elevated.
[0506] For example, if the server recognizes that an emergency vehicle is approaching from behind and the emotion engine determines that the user is feeling anxious, the video generation program will convert the instruction to "turn right" into sign language in a calm and soothing tone and display it on the terminal. In this case, easier-to-understand expressions or voice guidance may also be added.
[0507] The terminal immediately displays the generated sign language video and audio feedback on the head-up display, allowing the user to perform driving maneuvers to avoid emergency vehicles based on the provided instructions.
[0508] Thus, the present invention makes it possible to provide a safe and comfortable driving environment through advanced responses that reflect the user's emotional state.
[0509] The following describes the processing flow.
[0510] Step 1:
[0511] The terminal continuously collects ambient sounds through microphones placed in the vehicle. This collected audio data is transmitted to a server in real time.
[0512] Step 2:
[0513] The server receives audio data sent from the terminal and analyzes the audio using digital signal processing technology. It identifies characteristic sounds, such as the sirens of emergency vehicles, and detects emergencies.
[0514] Step 3:
[0515] The server estimates the direction and distance based on the detected emergency sound. It analyzes the sound intensity and direction to determine the optimal driving instructions for the user.
[0516] Step 4:
[0517] The server uses an emotion engine to collect data to analyze the user's current emotional state. This data is obtained from in-car cameras and sensors, and the system estimates emotions by analyzing the user's facial expressions and physiological responses.
[0518] Step 5:
[0519] The server takes the user's emotional state into consideration and customizes driving instructions for emergencies. If the emotion engine detects the user's stress or anxiety, it generates calming sign language videos or audio guidance.
[0520] Step 6:
[0521] The generated sign language video and audio feedback are immediately transmitted from the server to the terminal. The terminal displays this on a head-up display, providing the user with driving instructions.
[0522] Step 7:
[0523] The user checks the displayed sign language video and audio feedback and operates the vehicle according to the instructions. For example, if an evasive maneuver to the right is required, the user carefully turns the steering wheel to the right.
[0524] This series of processes enables safe and accurate driving assistance while taking the user's emotions into consideration.
[0525] (Example 2)
[0526] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0527] Auditory assistance technologies often fail to adequately communicate information to the hearing impaired or those with hearing limitations in emergency situations, lacking means to provide safety and a sense of security. This invention enhances visual guidance and enables safe movement by analyzing ambient acoustic information in real time and providing appropriate information visually, taking into account emotional states.
[0528] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0529] In this invention, the server includes acoustic collection means for collecting and transmitting acoustic information, acoustic analysis means for performing analysis, and emotion monitoring means for estimating emotional states. This enables users to move and respond safely while reducing stress even in emergency situations.
[0530] An "acoustic acquisition means" is a device for acquiring ambient acoustic information and transmitting it to a data processing device as needed.
[0531] "Acoustic analysis means" refers to a processing device that analyzes received acoustic information and identifies acoustic patterns that indicate a specific event.
[0532] "Emotion monitoring means" refers to a detection device and analysis system for determining a user's emotional state based on their biometric information.
[0533] A "guidance generation means" is a function that generates guidance and instructions suitable for the user based on the results of acoustic and emotional analysis.
[0534] "Display control means" refers to a control device for effectively presenting generated visual information on a display device.
[0535] "Position estimation means" refers to a technique for estimating the spatial location of a sound source using acoustic information.
[0536] A "sign language generation method" is a technology that converts audiovisual information into sign language and provides users with visual instructions.
[0537] This invention is an in-vehicle system for providing emergency information to users with hearing impairments. It analyzes acoustic information and the user's emotional state to provide appropriate instructions visually. The specific configuration for carrying out this invention is described below.
[0538] The system primarily consists of terminals and servers. First, the terminals are installed inside the vehicle and are equipped with microphones and cameras. The terminals collect ambient acoustic information and transmit it to the server in real time. Standard wireless communication technology is used for this communication.
[0539] The server is an advanced data processing device that analyzes the received acoustic data. Machine learning algorithms are used as the acoustic analysis means to identify specific acoustic patterns, such as the sound of an emergency vehicle siren. Common cloud analysis services can be used for this process. If an emergency is recognized as a result of the analysis, the server uses position estimation means to estimate the direction and distance of the sound source.
[0540] Furthermore, the device collects user facial expression data and biometric information, such as heart rate, and transmits it to the server. The server uses emotion monitoring to analyze the user's emotional state and assesses the user's stress and anxiety. Based on this emotional data, the server uses guidance generation to create calm and appropriate instructions for the user. It utilizes a generative AI model to adjust the tone and content of the instructions using prompt sentences.
[0541] Instructions are generated as sign language or visual information and projected as images onto the terminal's display device. Based on this visual information, users can safely avoid emergency vehicles. This allows users with hearing impairments to continue driving with peace of mind.
[0542] As a concrete example, suppose the device detects an ambulance siren and senses anxiety from the user's facial expression. In this case, the server displays the sign language message "Move to the right" in calm colors on the head-up display. At this time, the prompt message is set to "Generate a gentle instruction to move aside for emergency vehicles." This allows the user to respond appropriately.
[0543] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0544] Step 1:
[0545] The terminal uses an acoustic acquisition device to acquire acoustic information from the microphone as input. The collected acoustic data is subjected to noise reduction processing and output as clear audio data. This data is transmitted to the server using a communication device.
[0546] Step 2:
[0547] The server uses acoustic data received via acoustic analysis tools to identify specific acoustic patterns, such as emergency vehicle sirens. Here, machine learning algorithms are applied and compared with existing acoustic libraries to obtain highly accurate analysis output. The analyzed results then proceed to the next step.
[0548] Step 3:
[0549] The server estimates the spatial location of a sound source based on acoustic information. Using the analysis results as input, it calculates the direction and distance of the sound source using triangulation and outputs it as estimated location data. This data is used as the basic data for generating driving instructions.
[0550] Step 4:
[0551] The device uses emotion monitoring to collect user facial expression data and biometric information such as heart rate as input. This data is transmitted to the server in real time.
[0552] Step 5:
[0553] The server takes collected biometric information as input to analyze emotional states and performs analysis using an emotion monitoring system. The data outputs numerical representations of emotional states such as stress and anxiety. These analysis results are used for induction generation.
[0554] Step 6:
[0555] The server uses a generative AI model to generate appropriate driving instructions based on acoustic and emotional data. It sets prompt sentences and creates instructions such as "Swerve to the right in a calm tone," which are then output as derived content.
[0556] Step 7:
[0557] The terminal receives the generated instructions and outputs them as visual information to the head-up display using a display control mechanism. The user confirms this visual information and performs the driving operations appropriately. This enables safe and secure travel.
[0558] (Application Example 2)
[0559] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0560] Conventional driver assistance systems for the hearing impaired are specialized in detecting voice information and are unable to provide real-time driving instructions that take into account the user's emotional state. Therefore, the challenge is to provide an environment in which users can operate the vehicle safely and smoothly without feeling anxiety or stress in emergency situations.
[0561] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0562] In this invention, the server includes an acoustic acquisition means for detecting sounds indicating an emergency, an emotion analysis means for evaluating the user's emotions, and a video creation means for visualizing the instructions. This makes it possible to detect an emergency and simultaneously provide appropriate instructions in real time that correspond to the user's emotions.
[0563] "Sound acquisition means" refers to devices or mechanisms for collecting sounds from around a vehicle.
[0564] "Sound wave processing means" refers to a technology or process that analyzes collected sound data and identifies sound characteristics that indicate an emergency.
[0565] An "emotion analysis tool" is a system or algorithm that evaluates and analyzes a user's emotional state based on their facial expressions and biometric information.
[0566] "Video creation means" refers to technologies and devices that visually represent instructions provided to the user based on analyzed information.
[0567] "Display control means" refers to technologies and mechanisms for appropriately presenting generated visual instructions on a display device inside a vehicle.
[0568] A "driver assistance system" is a general term for technical devices and programs that assist drivers in safe driving.
[0569] "Sound source estimation means" refers to technologies and devices that estimate the direction and distance of detected sound and provide that information.
[0570] "Instructional display using sign language" refers to a technology or method for generating and communicating visual instructions based on sign language to users.
[0571] "Vibration feedback" is a feedback method that uses vibration to transmit information to the user through touch.
[0572] This driver assistance system is implemented using hardware such as microphones and cameras installed in the vehicle. Specifically, the system uses acoustic acquisition means to collect ambient sounds. The collected sound data is transmitted to a server and analyzed by sound wave processing means. Through this analysis, the server detects characteristic sounds such as emergency vehicle sirens. In addition, emotion analysis means are used to analyze the user's emotional state based on data from cameras and biosensors. For example, a model utilizing TensorFlow analyzes facial expressions.
[0573] Based on the analysis results and emotional information, the video creation system generates appropriate visual instructions. The generated instructions are expressed in the form of sign language videos or vibration feedback and provided to the driver. This information is presented on the in-vehicle display device by the display control system. For example, an instruction such as "Turn right to avoid the obstacle" is displayed on the screen in a gentle tone using sign language, and vibration feedback is sent to the user's smartwatch.
[0574] In this way, drivers receive real-time safe driving assistance, allowing them to drive with peace of mind while reducing anxiety and tension. An example of a prompt message generated using the AI model is: "An emergency vehicle is approaching. Turn right gently. Generate instructions in sign language and vibration that take the user's emotions into consideration."
[0575] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0576] Step 1:
[0577] The terminal acquires ambient sounds through a microphone mounted on the vehicle. The acquired audio data is transmitted from the terminal to a server via the network. This data is primarily used as analysis material for identifying emergency sounds.
[0578] Step 2:
[0579] The server analyzes the received audio data using sound wave processing equipment. Using software such as Librosa, it extracts features from the sound wave data and detects specific acoustic patterns, such as emergency vehicle sirens. This analysis determines whether an emergency situation exists. The output provides the direction of the sound source and the presence or absence of an emergency.
[0580] Step 3:
[0581] The server acquires the user's biometric data from a camera or smartwatch. Using emotion analysis tools, it analyzes the user's emotional state from facial expression data using deep learning models such as TensorFlow. The analysis results output the user's emotional state (e.g., stress level and anxiety level).
[0582] Step 4:
[0583] The server visualizes the instructions using video creation tools based on the results of acoustic analysis and emotional state data. The video creation process generates gentle sign language videos appropriate to the user's emotional state. The output is video data containing the instructions.
[0584] Step 5:
[0585] The terminal displays the generated video data on the vehicle's head-up display. Using a display control system, the user is presented with instructions in a gentle tone, translated into sign language, and vibration feedback. As a result, the user can safely perform driving operations based on the displayed information.
[0586] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0587] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0588] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.
[0589] [Fourth Embodiment]
[0590] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.
[0591] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.
[0592] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0593] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.
[0594] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0595] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0596] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0597] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.
[0598] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0599] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0600] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0601] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0602] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0603] This invention is a system for providing emergency information to visually impaired drivers using sound, and involves a server, a terminal, and a user.
[0604] First, the terminal continuously collects ambient sounds via microphones installed in the vehicle. The collected audio data is transmitted to a server. The server analyzes the received audio data and performs digital signal processing to identify specific sounds, such as the siren of an emergency vehicle. If this acoustic analysis detects the approach of an emergency vehicle, the system proceeds to the next stage.
[0605] From the detected acoustic data, the server estimates the direction and distance. Based on this, a video generation program within the server generates sign language video instructions for the driver. These videos include specific instructions such as "turn right," "turn left," and "stop."
[0606] The generated sign language video is transmitted directly to the terminal. The terminal immediately displays this video on the vehicle's head-up display, providing the user with visual information. Based on the displayed sign language video, the user can quickly perform driving maneuvers and take appropriate actions to safely avoid emergency vehicles.
[0607] For example, if the terminal detects the sound of an emergency vehicle's siren, the server analyzes the direction of the sound and confirms that it is approaching from the right rear. Based on this, the server generates a sign language video instructing the user to "turn right" and sends it to the terminal. The user then confirms this, appropriately moves their vehicle to the right, and yields to the emergency vehicle.
[0608] This system enables drivers with hearing impairments to respond quickly and safely to emergency vehicles, providing driving assistance.
[0609] The following describes the processing flow.
[0610] Step 1:
[0611] The terminal uses a microphone mounted on the vehicle to continuously collect ambient sounds. The collected audio data undergoes noise reduction processing before being sent to a server to improve the accuracy of sound detection.
[0612] Step 2:
[0613] The server receives audio data transmitted from the terminal. Using a digital signal processing algorithm, the server analyzes the audio data and detects specific sound patterns, such as emergency vehicle sirens.
[0614] Step 3:
[0615] When the sound of an emergency vehicle's siren is detected, the server estimates its direction and distance. This estimation utilizes changes in the siren's intensity and frequency.
[0616] Step 4:
[0617] The server assesses the urgency of the situation and determines the necessary driving instructions based on that assessment. These instructions include options such as "turn right," "turn left," and "stop," and the most appropriate one is selected depending on the situation.
[0618] Step 5:
[0619] The AI on the server generates sign language videos based on the determined driving instructions. These sign language videos use actions and symbols to express the instructions and are designed to be easily understood visually.
[0620] Step 6:
[0621] The generated sign language video is transmitted to the terminal in real time. A communication protocol is used to minimize delays, enabling drivers to respond quickly.
[0622] Step 7:
[0623] The terminal receives sign language video sent from the server and immediately displays it on the head-up display. This allows the user to visually receive driving instructions.
[0624] Step 8:
[0625] The user checks the displayed sign language video and performs driving operations accordingly. For example, if the instruction is "turn right," the user turns the steering wheel to the right and takes appropriate action to avoid an emergency vehicle.
[0626] (Example 1)
[0627] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0628] It is difficult for visually impaired drivers to recognize surrounding emergencies appropriately and quickly. Therefore, there is a need for effective means of providing information to warn of approaching emergency vehicles. Furthermore, conventional information identification and presentation technologies are prone to delays and inaccuracies in information, which can impair safe driving.
[0629] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0630] In this invention, the server includes an acoustic collection means for continuously collecting ambient sounds, an acoustic analysis means for analyzing the transmitted sound data and identifying specific sounds, a position estimation means for estimating the direction and distance of sounds from the analysis results, and a video generation means for generating instructions as video. This makes it possible to quickly and accurately notify visually impaired drivers of emergencies through visual information.
[0631] "Acoustic acquisition means" refers to a device or system for continuously acquiring ambient sounds.
[0632] "Data transmission means" refers to a method or technique for transmitting collected acoustic information to another device or system.
[0633] "Acoustic analysis means" refers to a technology or device for analyzing sound data and identifying specific sounds from it.
[0634] "Position estimation means" refers to a method or system for estimating the direction and distance of a sound source.
[0635] "Image generation means" refers to a technology or device for creating specific instruction information as a visual image.
[0636] "Display means" refers to a device or system for visually showing the generated image to the user.
[0637] This invention is a support system for quickly notifying visually impaired drivers of approaching emergency vehicles through audio and visual information.
[0638] Hardware and software for implementation
[0639] First, the terminal functions as an audio collection device installed in the vehicle. This device incorporates a high-sensitivity microphone to collect ambient sounds in real time. The collected audio information is transmitted to a server via a network module. To optimize network transmission, the device often utilizes Wi-Fi or mobile communication technologies.
[0640] The server is equipped with acoustic analysis software to analyze the received audio data. This analysis utilizes acoustic signal processing libraries, with Librosa being a specific example. Using this library, the server performs frequency analysis to identify the siren sound of a particular emergency vehicle.
[0641] Based on the audio analysis results, the server also runs a position estimation algorithm to estimate the direction and distance of the sound source. This algorithm employs a technique that uses the time difference of the acoustic signals.
[0642] To visualize the instructions, the server generates driving instructions for the user as sign language videos, based on a generation AI model. By utilizing AI generation technology in this process, video generation is performed quickly.
[0643] The generated video is sent back to the terminal and displayed on the vehicle's head-up display. This allows the user to take intuitive driving actions.
[0644] Specific example
[0645] For example, if the microphone picks up the sound of an emergency vehicle's siren while the user is driving, the server will analyze the results to indicate that the sound is approaching from the right rear. Based on this, a sign language video instructing the user to "turn right" is generated and displayed on the device. The user can then confirm this and quickly move their vehicle to the right to safely yield to the emergency vehicle.
[0646] Example of a prompt
[0647] "It analyzes surrounding audio data to detect the approach of emergency vehicles. Based on the direction and distance of the sound, it generates driving instructions in sign language and notifies the driver."
[0648] Based on the above, this invention can provide important support for visually impaired drivers to drive safely.
[0649] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0650] Step 1:
[0651] The terminal continuously collects ambient sounds using a high-sensitivity microphone installed in the vehicle. The input for this process is audio signals from the external environment, and the output is digital audio data. This digital audio data is temporarily stored within the terminal before being compressed and encrypted.
[0652] Step 2:
[0653] The device compresses the collected audio data and applies a compression algorithm to enable efficient network transmission. Next, the compressed data is encrypted to protect its confidentiality. The output of this process is encrypted and compressed audio data in a transferable format. This data is then sent to the server using Wi-Fi or 4G / 5G communication.
[0654] Step 3:
[0655] The server decrypts and removes encryption from the audio data received over the network. The input is encrypted, compressed audio data, and the output is uncompressed audio data in a format that can be processed. Acoustic analysis is then performed using this data.
[0656] Step 4:
[0657] The server performs acoustic analysis on uncompressed audio data. Here, the Librosa library is used to analyze the frequency spectrum and detect emergency vehicle siren sounds. The input is uncompressed audio data, and the output is a determination of whether or not it contains sounds of a specific emergency. Further processing is then carried out based on this result.
[0658] Step 5:
[0659] When a specific emergency sound is detected, the server estimates the direction and distance of the sound source. In this step, sound localization techniques are used to calculate the direction, followed by further data processing. The input is time-delay information of the audio signal, and the output is the estimated direction and distance of the sound source.
[0660] Step 6:
[0661] The server generates instructions for the user based on estimated sound source information. Here, a generation AI model is used to generate sign language videos to visually instruct the driver. The input is the direction and distance of the sound source, and the output is video data corresponding to the instruction content.
[0662] Step 7:
[0663] The generated sign language video is transmitted to the terminal. The terminal receives the video data and displays it on the head-up display inside the vehicle. The input is video data, and the output is a visual notification to the user.
[0664] Step 8:
[0665] The user views sign language images displayed on a head-up display and performs appropriate driving actions based on those images. This allows the user to safely respond to emergency vehicles. The input is sign language images, and the output is driving actions performed in accordance with the instructions.
[0666] (Application Example 1)
[0667] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0668] Drivers with visual or hearing impairments face challenges in accurately recognizing and responding quickly and safely to emergencies occurring in their surroundings. Therefore, there is a need for systems that accurately notify drivers of important situations, such as the approach of emergency vehicles, and support safe driving.
[0669] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0670] In this invention, the server includes detection means for detecting sound, analysis means for analyzing the characteristics of the sound, and generation means for visualizing the instruction content based on the analysis results. This enables drivers with visual or auditory limitations to accurately perceive the situation and perform appropriate driving operations.
[0671] A "sound detection means" refers to a device or system that senses ambient sounds and acquires them as digital data.
[0672] "Analysis means for analyzing the characteristics of the sound" refers to a device or system that is responsible for the process of analyzing the characteristics of acquired audio data and identifying specific sounds or patterns.
[0673] "Generating means for visualizing instruction content based on analysis results" refers to a device or system that generates information in a format that visually conveys the instructions derived from the analyzed voice data to the driver.
[0674] "Estimation means for estimating direction and distance" refers to a device or system that measures the direction and distance of a sound source from audio data and presents that positional relationship to the driver.
[0675] "Generating visual information using sign language instructions" means expressing specific instructions as images using sign language as a visual method.
[0676] This invention is a system in which a terminal installed in a vehicle while it is in motion senses ambient sounds, analyzes the situation such as the approach of an emergency vehicle using voice analysis, and provides visual information to the driver.
[0677] First, the device collects ambient sounds from the vehicle in order to detect sound. Specifically, it acquires surrounding audio data using the microphone of a smartphone or a dedicated device.
[0678] Next, the server receives the audio data and analyzes its characteristics. Using speech recognition technologies such as Google Cloud Speech-to-Text API and Amazon Transcribe, it analyzes the audio signal in real time. This allows it to recognize and identify the sound of emergency vehicle sirens.
[0679] Based on the analysis results, the instructions are visualized using a generative AI model. A deep learning framework such as TensorFlow is used to generate instructional videos, including sign language. These videos are transmitted to smartphones and other display devices. The driver then uses this visual information to perform appropriate driving maneuvers.
[0680] For example, if the sound of an emergency vehicle's siren is approaching from the right rear, the server analyzes the data and generates a specific sign language video indicating "turn right," which is then displayed on the terminal. This information allows the driver to quickly move their vehicle to the right and yield to the emergency vehicle.
[0681] An example of a prompt message would be: "Use your smartphone's microphone to analyze ambient sounds, identify emergency vehicle sirens, and generate sign language video based on the direction of the sound."
[0682] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0683] Step 1:
[0684] The device uses the microphone of a smartphone or other device to collect ambient sounds. The input is an audio signal, and the output is generated as audio data. This data is transmitted to a server in real time. Specifically, it captures noise inside the vehicle and external sounds and converts them into a digital format.
[0685] Step 2:
[0686] The server analyzes the received audio data. The input is the audio data generated in step 1, and the output is the identification of specific patterns or sounds within the audio. Here, the data is analyzed using tools such as the Google Cloud Speech-to-Text API to identify sounds like emergency vehicle sirens. Specifically, this involves analyzing the frequency characteristics and volume changes within the data.
[0687] Step 3:
[0688] The server generates instructional videos based on the analysis results. The input is the sound identification result obtained in step 2, and the output is instructional videos using sign language. Generative AI models such as TensorFlow are used to visualize specific instructions and generate videos. Specifically, the system renders sign language frames based on the analysis results.
[0689] Step 4:
[0690] The server sends the generated instruction video to the terminal. The input is the visual information created in step 3, and the output is the video that can be viewed on the terminal. The terminal then performs the specific action of projecting this video onto a display device. This provides the user with real-time visual instructions.
[0691] Step 5:
[0692] The user performs driving operations based on sign language video displayed on the terminal. The input is the sign language video displayed in step 4, and the output is the change in the user's driving behavior. Specifically, actions are taken to ensure safety, such as appropriately changing lanes in response to the approach of an emergency vehicle.
[0693] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0694] This invention combines a system that uses sound to provide emergency information to the hearing impaired with an emotion engine that recognizes the user's emotions and provides appropriate feedback. This enables driving assistance that takes the user's mental state into consideration.
[0695] First, the terminal collects ambient sounds through a microphone installed in the vehicle. The collected audio data is sent to a server via the network. The server analyzes this audio data in detail and detects sound patterns such as the sirens of emergency vehicles. When an emergency is detected through this analysis, the server estimates the direction and distance of the sound source and determines the necessary driving instructions.
[0696] In addition to acoustic analysis, the emotion engine receives input from the user's facial expressions and body sensors to understand the user's current emotional state. Based on this data, the server analyzes the user's emotions and provides appropriate feedback if stress or anxiety is elevated.
[0697] For example, if the server recognizes that an emergency vehicle is approaching from behind and the emotion engine determines that the user is feeling anxious, the video generation program will convert the instruction to "turn right" into sign language in a calm and soothing tone and display it on the terminal. In this case, easier-to-understand expressions or voice guidance may also be added.
[0698] The terminal immediately displays the generated sign language video and audio feedback on the head-up display, allowing the user to perform driving maneuvers to avoid emergency vehicles based on the provided instructions.
[0699] Thus, the present invention makes it possible to provide a safe and comfortable driving environment through advanced responses that reflect the user's emotional state.
[0700] The following describes the processing flow.
[0701] Step 1:
[0702] The terminal continuously collects ambient sounds through microphones placed in the vehicle. This collected audio data is transmitted to a server in real time.
[0703] Step 2:
[0704] The server receives audio data sent from the terminal and analyzes the audio using digital signal processing technology. It identifies characteristic sounds, such as the sirens of emergency vehicles, and detects emergencies.
[0705] Step 3:
[0706] The server estimates the direction and distance based on the detected emergency sound. It analyzes the sound intensity and direction to determine the optimal driving instructions for the user.
[0707] Step 4:
[0708] The server uses an emotion engine to collect data to analyze the user's current emotional state. This data is obtained from in-car cameras and sensors, and the system estimates emotions by analyzing the user's facial expressions and physiological responses.
[0709] Step 5:
[0710] The server takes the user's emotional state into consideration and customizes driving instructions for emergencies. If the emotion engine detects the user's stress or anxiety, it generates calming sign language videos or audio guidance.
[0711] Step 6:
[0712] The generated sign language video and audio feedback are immediately transmitted from the server to the terminal. The terminal displays this on a head-up display, providing the user with driving instructions.
[0713] Step 7:
[0714] The user checks the displayed sign language video and audio feedback and operates the vehicle according to the instructions. For example, if an evasive maneuver to the right is required, the user carefully turns the steering wheel to the right.
[0715] This series of processes enables safe and accurate driving assistance while taking the user's emotions into consideration.
[0716] (Example 2)
[0717] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0718] Auditory assistance technologies often fail to adequately communicate information to the hearing impaired or those with hearing limitations in emergency situations, lacking means to provide safety and a sense of security. This invention enhances visual guidance and enables safe movement by analyzing ambient acoustic information in real time and providing appropriate information visually, taking into account emotional states.
[0719] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0720] In this invention, the server includes acoustic collection means for collecting and transmitting acoustic information, acoustic analysis means for performing analysis, and emotion monitoring means for estimating emotional states. This enables users to move and respond safely while reducing stress even in emergency situations.
[0721] An "acoustic acquisition means" is a device for acquiring ambient acoustic information and transmitting it to a data processing device as needed.
[0722] "Acoustic analysis means" refers to a processing device that analyzes received acoustic information and identifies acoustic patterns that indicate a specific event.
[0723] "Emotion monitoring means" refers to a detection device and analysis system for determining a user's emotional state based on their biometric information.
[0724] A "guidance generation means" is a function that generates guidance and instructions suitable for the user based on the results of acoustic and emotional analysis.
[0725] "Display control means" refers to a control device for effectively presenting generated visual information on a display device.
[0726] "Position estimation means" refers to a technique for estimating the spatial location of a sound source using acoustic information.
[0727] A "sign language generation method" is a technology that converts audiovisual information into sign language and provides users with visual instructions.
[0728] This invention is an in-vehicle system for providing emergency information to users with hearing impairments. It analyzes acoustic information and the user's emotional state to provide appropriate instructions visually. The specific configuration for carrying out this invention is described below.
[0729] The system primarily consists of terminals and servers. First, the terminals are installed inside the vehicle and are equipped with microphones and cameras. The terminals collect ambient acoustic information and transmit it to the server in real time. Standard wireless communication technology is used for this communication.
[0730] The server is an advanced data processing device that analyzes the received acoustic data. Machine learning algorithms are used as the acoustic analysis means to identify specific acoustic patterns, such as the sound of an emergency vehicle siren. Common cloud analysis services can be used for this process. If an emergency is recognized as a result of the analysis, the server uses position estimation means to estimate the direction and distance of the sound source.
[0731] Furthermore, the device collects user facial expression data and biometric information, such as heart rate, and transmits it to the server. The server uses emotion monitoring to analyze the user's emotional state and assesses the user's stress and anxiety. Based on this emotional data, the server uses guidance generation to create calm and appropriate instructions for the user. It utilizes a generative AI model to adjust the tone and content of the instructions using prompt sentences.
[0732] Instructions are generated as sign language or visual information and projected as images onto the terminal's display device. Based on this visual information, users can safely avoid emergency vehicles. This allows users with hearing impairments to continue driving with peace of mind.
[0733] As a concrete example, suppose the device detects an ambulance siren and senses anxiety from the user's facial expression. In this case, the server displays the sign language message "Move to the right" in calm colors on the head-up display. At this time, the prompt message is set to "Generate a gentle instruction to move aside for emergency vehicles." This allows the user to respond appropriately.
[0734] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0735] Step 1:
[0736] The terminal uses an acoustic acquisition device to acquire acoustic information from the microphone as input. The collected acoustic data is subjected to noise reduction processing and output as clear audio data. This data is transmitted to the server using a communication device.
[0737] Step 2:
[0738] The server uses acoustic data received via acoustic analysis tools to identify specific acoustic patterns, such as emergency vehicle sirens. Here, machine learning algorithms are applied and compared with existing acoustic libraries to obtain highly accurate analysis output. The analyzed results then proceed to the next step.
[0739] Step 3:
[0740] The server estimates the spatial location of a sound source based on acoustic information. Using the analysis results as input, it calculates the direction and distance of the sound source using triangulation and outputs it as estimated location data. This data is used as the basic data for generating driving instructions.
[0741] Step 4:
[0742] The device uses emotion monitoring to collect user facial expression data and biometric information such as heart rate as input. This data is transmitted to the server in real time.
[0743] Step 5:
[0744] The server takes collected biometric information as input to analyze emotional states and performs analysis using an emotion monitoring system. The data outputs numerical representations of emotional states such as stress and anxiety. These analysis results are used for induction generation.
[0745] Step 6:
[0746] The server uses a generative AI model to generate appropriate driving instructions based on acoustic and emotional data. It sets prompt sentences and creates instructions such as "Swerve to the right in a calm tone," which are then output as derived content.
[0747] Step 7:
[0748] The terminal receives the generated instructions and outputs them as visual information to the head-up display using a display control mechanism. The user confirms this visual information and performs the driving operations appropriately. This enables safe and secure travel.
[0749] (Application Example 2)
[0750] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0751] Conventional driver assistance systems for the hearing impaired are specialized in detecting voice information and are unable to provide real-time driving instructions that take into account the user's emotional state. Therefore, the challenge is to provide an environment in which users can operate the vehicle safely and smoothly without feeling anxiety or stress in emergency situations.
[0752] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0753] In this invention, the server includes an acoustic acquisition means for detecting sounds indicating an emergency, an emotion analysis means for evaluating the user's emotions, and a video creation means for visualizing the instructions. This makes it possible to detect an emergency and simultaneously provide appropriate instructions in real time that correspond to the user's emotions.
[0754] "Sound acquisition means" refers to devices or mechanisms for collecting sounds from around a vehicle.
[0755] "Sound wave processing means" refers to a technology or process that analyzes collected sound data and identifies sound characteristics that indicate an emergency.
[0756] An "emotion analysis tool" is a system or algorithm that evaluates and analyzes a user's emotional state based on their facial expressions and biometric information.
[0757] "Video creation means" refers to technologies and devices that visually represent instructions provided to the user based on analyzed information.
[0758] "Display control means" refers to technologies and mechanisms for appropriately presenting generated visual instructions on a display device inside a vehicle.
[0759] A "driver assistance system" is a general term for technical devices and programs that assist drivers in safe driving.
[0760] "Sound source estimation means" refers to technologies and devices that estimate the direction and distance of detected sound and provide that information.
[0761] "Instructional display using sign language" refers to a technology or method for generating and communicating visual instructions based on sign language to users.
[0762] "Vibration feedback" is a feedback method that uses vibration to transmit information to the user through touch.
[0763] This driver assistance system is implemented using hardware such as microphones and cameras installed in the vehicle. Specifically, the system uses acoustic acquisition means to collect ambient sounds. The collected sound data is transmitted to a server and analyzed by sound wave processing means. Through this analysis, the server detects characteristic sounds such as emergency vehicle sirens. In addition, emotion analysis means are used to analyze the user's emotional state based on data from cameras and biosensors. For example, a model utilizing TensorFlow analyzes facial expressions.
[0764] Based on the analysis results and emotional information, the video creation system generates appropriate visual instructions. The generated instructions are expressed in the form of sign language videos or vibration feedback and provided to the driver. This information is presented on the in-vehicle display device by the display control system. For example, an instruction such as "Turn right to avoid the obstacle" is displayed on the screen in a gentle tone using sign language, and vibration feedback is sent to the user's smartwatch.
[0765] In this way, drivers receive real-time safe driving assistance, allowing them to drive with peace of mind while reducing anxiety and tension. An example of a prompt message generated using the AI model is: "An emergency vehicle is approaching. Turn right gently. Generate instructions in sign language and vibration that take the user's emotions into consideration."
[0766] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0767] Step 1:
[0768] The terminal acquires ambient sounds through a microphone mounted on the vehicle. The acquired audio data is transmitted from the terminal to a server via the network. This data is primarily used as analysis material for identifying emergency sounds.
[0769] Step 2:
[0770] The server analyzes the received audio data using sound wave processing equipment. Using software such as Librosa, it extracts features from the sound wave data and detects specific acoustic patterns, such as emergency vehicle sirens. This analysis determines whether an emergency situation exists. The output provides the direction of the sound source and the presence or absence of an emergency.
[0771] Step 3:
[0772] The server acquires the user's biometric data from a camera or smartwatch. Using emotion analysis tools, it analyzes the user's emotional state from facial expression data using deep learning models such as TensorFlow. The analysis results output the user's emotional state (e.g., stress level and anxiety level).
[0773] Step 4:
[0774] The server visualizes the instructions using video creation tools based on the results of acoustic analysis and emotional state data. The video creation process generates gentle sign language videos appropriate to the user's emotional state. The output is video data containing the instructions.
[0775] Step 5:
[0776] The terminal displays the generated video data on the vehicle's head-up display. Using a display control system, the user is presented with instructions in a gentle tone, translated into sign language, and vibration feedback. As a result, the user can safely perform driving operations based on the displayed information.
[0777] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0778] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0779] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.
[0780] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.
[0781] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. In the upper and lower directions of the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. Also, the upper side of the concentric circles is where "pleasant" emotions are located, and the lower side is where "unpleasant" emotions are located. In this way, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.
[0782] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.
[0783] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.
[0784] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.
[0785] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."
[0786] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.
[0787] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.
[0788] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.
[0789] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.
[0790] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.
[0791] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.
[0792] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.
[0793] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.
[0794] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.
[0795] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.
[0796] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.
[0797] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.
[0798] The following is further disclosed regarding the embodiments described above.
[0799] (Claim 1)
[0800] Acoustic detection means for detecting sounds indicating an emergency,
[0801] Acoustic analysis means for analyzing the sound characteristics indicating the emergency,
[0802] A video generation means that visualizes the instructions based on the analysis results,
[0803] A display control means for projecting the image onto a vehicle's display device,
[0804] A system that includes this.
[0805] (Claim 2)
[0806] The system according to claim 1, further comprising estimation means for estimating the direction and distance of sounds indicating an emergency.
[0807] (Claim 3)
[0808] The system according to claim 1, characterized in that the video generation means generates instructional videos using sign language.
[0809] "Example 1"
[0810] (Claim 1)
[0811] A sound collection means for continuously collecting ambient sounds,
[0812] A data transmission means for transmitting the collected sound data over a network,
[0813] An acoustic analysis means that analyzes transmitted sound data and identifies a specific sound,
[0814] A position estimation means that estimates the direction and distance of sound from the analysis results,
[0815] Based on the estimation, a video generation means generates the instruction content as a video,
[0816] A display means for rendering the generated video onto a display unit inside the vehicle,
[0817] A system that includes this.
[0818] (Claim 2)
[0819] The system according to claim 1, characterized in that the generated video is created using a generative model that enables diverse forms of expression.
[0820] (Claim 3)
[0821] The system according to claim 1, wherein the video generation means operates for the purpose of instructing the user on appropriate driving operations.
[0822] "Application Example 1"
[0823] (Claim 1)
[0824] A detection means for detecting sound,
[0825] An analysis means for analyzing the characteristics of the sound,
[0826] A generation means for visualizing the instruction content based on the analysis results,
[0827] Control means for displaying the visualized instructions on a display device of an electronic device,
[0828] A system that includes this.
[0829] (Claim 2)
[0830] The system according to claim 1, further comprising estimation means for estimating direction and distance.
[0831] (Claim 3)
[0832] The system according to claim 1, characterized in that the generation means generates visual information for instructions using sign language.
[0833] "Example 2 of combining an emotion engine"
[0834] (Claim 1)
[0835] Acoustic collection means for collecting ambient acoustic information and transmitting it to a data processing device using communication technology,
[0836] An acoustic analysis means for analyzing received acoustic information and identifying acoustic patterns that indicate an event,
[0837] An emotion monitoring means for detecting a user's biometric information and estimating their emotional state,
[0838] A guidance generation means for generating appropriate guidance content based on analysis results and emotional state,
[0839] A display control means that presents the generated guidance content as visual information on a display device,
[0840] A system that includes this.
[0841] (Claim 2)
[0842] The system according to claim 1, further comprising a position estimation means for estimating the spatial position of a sound source based on acoustic information.
[0843] (Claim 3)
[0844] The system according to claim 1, characterized by including sign language generation means for generating visual information using sign language.
[0845] "Application example 2 when combining with an emotional engine"
[0846] (Claim 1)
[0847] A means for acquiring sound to detect sounds indicating an emergency,
[0848] Sound wave processing means for analyzing the sound characteristics indicating the emergency,
[0849] A means of analyzing user emotions,
[0850] A video creation method that visualizes instructions based on analysis results and emotional evaluation,
[0851] A display control means that displays the video on the vehicle's display device,
[0852] A driver assistance system that includes this.
[0853] (Claim 2)
[0854] The driving assistance system according to claim 1, further comprising a sound source estimation means for estimating the direction and distance of a sound indicating an emergency.
[0855] (Claim 3)
[0856] The driving assistance system according to claim 1, characterized in that the video creation means creates instruction displays using sign language and generates vibration feedback based on emotional evaluation. [Explanation of Symbols]
[0857] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>
Claims
1. A detection means for detecting sound, An analysis means for analyzing the characteristics of the sound, A generation means for visualizing the instruction content based on the analysis results, Control means for displaying the visualized instructions on a display device of an electronic device, A system that includes this.
2. The system according to claim 1, further comprising estimation means for estimating direction and distance.
3. The system according to claim 1, characterized in that the generation means generates visual information for instructions using sign language.