system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A system analyzes baby cries to generate personalized audio and video content using AI, addressing the challenge of calming crying babies and reducing caregiver stress through real-time feedback loops.

JP2026100697APending Publication Date: 2026-06-19SOFTBANK GROUP CORP

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: SOFTBANK GROUP CORP
Filing Date: 2024-12-09
Publication Date: 2026-06-19

Application Information

Patent Timeline

09 Dec 2024

Application

19 Jun 2026

Publication

JP2026100697A

IPC: G10L25/51; G10L19/00; G10L25/48; G10L21/028; G10L21/12; G10L13/02; G10L19/02; G10L13/04; G10L13/08; G10L21/18; G10L25/00; G10L19/16; G10L13/06; G10L13/00

AI Tagging

Application Domain

Speech synthesis

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Method and system for generating synthesis voice using style tag represented by natural language
EP4712072A8Speech recognition Acquiring/recognising facial features
system
JP2026107907AData processing applications Special service for subscribers
system
JP2026101184AOther databases retrieval Sound input/output Information processingMedication information
Voice playing method, device, equipment, storage medium and computer program product
CN122116871AEnhance listening experiencevivid voiceSpeech synthesis Engineering Human–computer interaction
Method and system for converting plant procedures to voice interfacing smart procedures
US12664980B2Resources Speech recognition Software engineering Application procedure

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Caregivers face challenges in effectively addressing a baby's crying, particularly for novice caregivers, as it requires time and experience to identify efficient calming methods tailored to individual babies, leading to stress and suboptimal childcare environments.

Method used

A system that collects and analyzes baby voice data to extract features, compares them with past data, and generates personalized sounds and images using AI to soothe crying babies, incorporating user feedback to improve the generation algorithm.

Benefits of technology

Reduces caregiver burden and creates a more comfortable environment by providing individualized responses to a baby's crying, leveraging AI to generate and playback optimized audio and video content based on real-time analysis and feedback.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 2026100697000001_ABST

Patent Text Reader

Abstract

We provide the system. [Solution] A voice input method for collecting baby voice data, A data processing means for extracting feature quantities for analyzing the aforementioned audio data, A matching means for performing a match with a database based on the aforementioned feature quantities, A generation means for generating appropriate sound and video based on the aforementioned matching results, A playback means for outputting the generated sound and video, A system that includes this.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] Since there are various reasons why a baby cries, caregivers may face great stress in dealing with this. Also, it takes time and experience to identify an efficient way to stop a baby from crying, which is particularly difficult for novice caregivers. In such a situation, there is a need to provide support for stopping a baby from crying in a way that can accommodate individual babies.

Means for Solving the Problems

[0005] This invention includes data processing means for collecting and analyzing baby voice data to extract features. It also includes generation means that use matching means to compare with past data to generate optimal sounds and images. The generated sounds and images are output by playback means, and the system provides individualized support to soothe crying babies by improving the generation algorithm using user feedback.

[0006] "Voice input means" refers to a device or function used to collect sounds such as a baby crying or ambient sounds.

[0007] "Data processing means" refers to a device or program used to analyze collected audio data and extract characteristic features of crying sounds.

[0008] A "comparison means" is a device or function used to compare extracted features with past data.

[0009] "Generation means" refers to a device or program that generates the optimal sound or image to soothe a crying baby based on the matching results.

[0010] A "playback device" is a device or function that physically outputs the generated sound or image for a baby to hear or see.

[0011] "Feedback information" refers to information that users input to observe their baby's reactions and report their findings. [Brief explanation of the drawing]

[0012] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3]It is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which a plurality of emotions are mapped. [Figure 10] It shows an emotion map to which a plurality of emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Example 2 when an emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when an emotion engine is combined.

Mode for Carrying Out the Invention

[0013] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0014] First, the language used in the following description will be explained.

[0015] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0016] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0017] In the following embodiments, the numbered storage is one or more non-volatile storage devices that store various programs, various parameters, and the like. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.

[0018] In the following embodiments, the numbered communication I / F (Interface) is an interface including a communication processor, an antenna, and the like. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark), and the like.

[0019] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0020] [First Embodiment]

[0021] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0022] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0023] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0024] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0025] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0026] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0027] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0028] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0029] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0030] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0031] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0032] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0033] This invention is a system that soothes a crying baby based on content generated from collecting, analyzing, and recording the baby's cries, and is implemented by incorporating a related program into each device.

[0034] The device is placed near the baby and collects audio data, including crying sounds, using a microphone. This collected data undergoes initial processing internally for noise reduction and audio normalization before being sent to a server.

[0035] The server analyzes the received audio data and extracts features of crying, such as pitch, intensity, and pattern. Next, these features are compared with a vast amount of past baby crying data stored in a cloud-based database. Based on the comparison results, the server identifies sound and video patterns that are considered effective in calming crying babies, drawing from similar past cases.

[0036] Based on the identified patterns, the server uses generative AI technology to generate special audio and video content tailored to the baby's preferences. This generated content is then sent back to the device.

[0037] The device plays generated sounds and images towards the baby. The parent or guardian observes the baby's reaction to the played sounds and images and determines whether they were effective. These observations are sent as feedback to the server through the device's interface.

[0038] This feedback information will be used to further improve the server's AI generation algorithm, aiming to provide a personalized calming method for each baby through repeated use. This entire process will reduce the burden on caregivers and create a more comfortable environment for babies.

[0039] The following describes the processing flow.

[0040] Step 1:

[0041] The device uses a microphone placed in the environment where the baby is present to collect the baby's cries. During the collection process, ambient sounds are also recorded and stored as data.

[0042] Step 2:

[0043] The device performs noise reduction and audio normalization on the recorded audio data. This prepares the data for analysis.

[0044] Step 3:

[0045] The terminal sends pre-processed audio data to the server. A communication network is used for transmission, providing data in real time.

[0046] Step 4:

[0047] The server analyzes the received audio data and extracts audio features (such as pitch, intensity, and temporal variation). Each feature is then quantified as digital data.

[0048] Step 5:

[0049] The server uses the extracted features to compare them with past baby data stored in the database. It searches for data with similar patterns and identifies the optimal solution.

[0050] Step 6:

[0051] The server uses database information and AI generation technology to generate sounds and images suitable for calming a crying baby. The generated results are then published as digital content.

[0052] Step 7:

[0053] The server transfers the generated audio and video data to the terminal. During this process, various protocols are used to ensure data reliability.

[0054] Step 8:

[0055] The device plays the received sound and video towards the baby. Playback is done through the speaker and display.

[0056] Step 9:

[0057] Users observe the effects of the played sounds and images on the baby, thereby evaluating the effectiveness of the sounds and images.

[0058] Step 10:

[0059] The user inputs the baby's response results as feedback into the device. This feedback is done via the UI and sent to the server.

[0060] Step 11:

[0061] The server updates its AI generation algorithm based on the received feedback information, preparing to generate more effective sound and video. This improvement process contributes to improving the system's accuracy.

[0062] (Example 1)

[0063] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0064] There is a need to reduce the burden on caregivers regarding their babies' crying and to provide the most appropriate response for each individual baby. However, conventional technology has not been sufficient to effectively analyze crying and provide appropriate content. Therefore, the challenge is to realize an efficient system that collects and analyzes babies' crying and generates and plays content optimized to soothe them.

[0065] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0066] In this invention, the server includes a sound acquisition means for collecting sound, an information processing means for analyzing the sound and extracting features, and a correspondence means for matching the features with stored information. This makes it possible to efficiently respond to a baby's crying, generate and play appropriate content, and reduce the burden on caregivers.

[0067] "Sound acquisition means" refers to devices or functions for collecting sounds from the surrounding environment.

[0068] "Information processing means" refers to the technologies and processes used to analyze collected sounds and extract features from them.

[0069] "Means of response" refers to a function that confirms and compares the extracted features with the accumulated information.

[0070] "Information generation means" refers to technology that generates new content such as audio and video based on the results of the correspondence.

[0071] "Information presentation means" refers to devices or functions that allow users to view generated content.

[0072] The "function to remove noise and shape sound" refers to technology that removes unwanted noise from audio and prepares it as accurate audio data.

[0073] "Evaluation information" refers to reactions and feedback data obtained from users, and is used to improve generation technology.

[0074] This invention relates to a system that collects, analyzes, and plays back generated content to a baby in order to provide an effective response to a baby's crying. This system mainly consists of two main components: a terminal and a server.

[0075] The device is placed near the baby and collects ambient sounds using voice acquisition technology. In this process, it is crucial to use a microphone to capture voices with high sensitivity and accuracy. The collected voice data is initially processed within the device by a voice processing library that performs noise reduction and voice normalization. This allows for accurate capture of crying even in noisy environments.

[0076] The initially processed audio data is transmitted to the server via the network. The server analyzes this audio using information processing tools and extracts features. The analysis software used is a speech recognition engine, which performs time-frequency analysis of the audio waveform. This clearly extracts features such as the pitch, intensity, and pattern of crying sounds.

[0077] After extracting features, the server compares them with a cloud-based database through a corresponding mechanism. Here, using an SQL database system, similar cases are identified by comparing and matching with a vast amount of historical data. Based on the valid patterns discovered through this matching, information generation tools utilizing generative AI technology are used to generate appropriate audio and video.

[0078] The generated content is sent back to the device and played back to the baby via an information display device. Here, speakers and displays are used to provide effective sounds and images that will capture the baby's attention. For example, if the baby is calmed by classical music, the prompt "Generate a calming melody based on classical music" can be entered into the generating AI.

[0079] Users observe their baby's reaction to the generated content and send feedback to the server via the device's interface. This feedback information is used as evaluation data to improve the generation algorithm on the server, enabling more personalized responses. In this way, an optimal and comfortable childcare environment can be provided for each individual baby.

[0080] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0081] Step 1:

[0082] The device is placed near the baby and collects ambient sounds using a microphone. The input is ambient noise, including the baby's crying. The device uses an audio processing library to denoise and normalize this audio data. This results in noise-removed and normalized crying data as output.

[0083] Step 2:

[0084] This initially processed audio data is sent to a server via the internet. The server receives this audio data as input and performs analysis using a speech recognition engine. During the analysis, features of the crying sound, specifically pitch, intensity, and pattern, are extracted. The output here is the extracted features.

[0085] Step 3:

[0086] The server matches the extracted features against a database in the cloud. The input is the extracted features, which are compared and matched against similar historical data using an SQL database system. This process identifies valid audio and video patterns from similar cases. The output is the identified audio and video patterns.

[0087] Step 4:

[0088] Based on the matching results, the server uses a generative AI model to generate new audio and video. The input is a specific pattern, which the generative AI model operates on to create special content suitable for babies. The output is the generated audio and video content.

[0089] Step 5:

[0090] The generated content is sent back to the device and played there. The device uses its speaker and display to present this content to the baby. The output is the audio and video being played, and the goal is to soothe the crying baby.

[0091] Step 6:

[0092] The user observes the baby's reaction to the played content and sends the results back to the server as feedback through the device's interface. The input is the baby's reaction information, which the server's generation AI algorithm uses to improve itself. The output is the improved generation algorithm, which can be used to generate more effective content.

[0093] (Application Example 1)

[0094] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0095] In modern society, reducing the burden of childcare is a crucial issue. In particular, effective and individualized approaches are needed to address a baby's crying, but traditional methods have their limitations. Furthermore, a baby's persistent crying is a source of parental stress, highlighting the need for improvements to the childcare environment. To address these challenges, a system is needed that can accurately analyze a baby's crying and provide appropriate responses instantly.

[0096] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0097] In this invention, the server includes an acoustic input means for collecting baby's voice information, an information processing means for analyzing the voice information and extracting attributes, and a matching means for matching the attributes with an information recording medium. This makes it possible to quickly analyze a baby's cries and respond appropriately to each individual baby.

[0098] "Acoustic input means" refers to a device used to collect sound information from the baby's surroundings, and includes microphones and sound sensors.

[0099] An "information processing device" is a device that has the function of extracting attributes based on collected audio information, and utilizes digital signal processing technology.

[0100] A "matching device" is a device for comparing and referencing extracted attributes with existing information recording media.

[0101] A "generation means" is a device that has the function of generating optimal audio or video based on the matching results, and utilizes generation AI technology.

[0102] "Presentation means" refers to a device that allows a baby to view generated audio and video, and includes speakers and displays.

[0103] A "control system" is a system for managing this series of processes and controlling the coordinated operation of each device.

[0104] An "automated machine" is a device such as a robot that automatically generates and presents sounds and images based on a baby's cries.

[0105] "Noise reduction" is a technique for removing unwanted noise from audio information and extracting clear speech.

[0106] "Speech standardization" is the process of organizing speech information collected under different circumstances and environments into a consistent format.

[0107] "Evaluation information" refers to reaction data and feedback collected from users, which is used to improve the generation process.

[0108] "Generation procedure" refers to the steps of the process or algorithm used to generate audio or video.

[0109] The system that implements this application collects and analyzes a baby's cries, and generates and presents appropriate content to soothe the baby. The entire system operates through the cooperation of three parties: the server, the terminal, and the user.

[0110] The server plays a central role in receiving and processing the baby's voice information. The collected voice information undergoes noise reduction and voice standardization on the server, and voice attributes are extracted by information processing tools. These extracted attributes are then compared with information recording media to generate optimal audio and video. A generation AI model is used as the generation tool, creating content tailored to the baby's preferences based on specific prompts.

[0111] The device is placed around the baby and collects crying sounds using an acoustic input device. This device transmits the collected audio information to a server, which receives the generated content and presents it to the baby. Speakers and displays are used as presentation methods.

[0112] Users observe the baby's reactions via their device and send evaluation information to the server. This evaluation information is used to improve the generation process.

[0113] For example, if a baby starts crying in the middle of the night, the device detects the crying and sends it to a server. A generation AI model then generates and plays calming melodies or videos of nature sounds. An example of a prompt would be, "Analyze the baby's crying and generate a calming sound with the following characteristics: gentle tone, calming melody, nature sounds."

[0114] This system provides content optimized for each individual baby, reducing the burden on caregivers and enabling the creation of a more comfortable childcare environment.

[0115] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0116] Step 1:

[0117] The device collects crying sounds using an acoustic input device placed near the baby. The input at this time is audio information, including the baby's crying. The collected audio information is denoised and standardized within the device. The output is noise-free and standardized audio data.

[0118] Step 2:

[0119] The terminal sends the processed audio data to the server. The input is the audio data obtained by the terminal, and the same audio data is sent to the server. By the time the audio data arrives at the server, it is ready for analysis.

[0120] Step 3:

[0121] The server analyzes the received audio data using information processing tools and extracts audio attributes. The input is audio data, and the output is audio attributes such as pitch, intensity, and pattern. The server uses digital signal processing technology to detect audio attributes in detail.

[0122] Step 4:

[0123] The server compares the extracted audio attributes with the information recording medium. The input is the audio attributes, and the output is the result of matching them with similar attributes referenced from the information recording medium. Using this result, the server determines what kind of content is appropriate.

[0124] Step 5:

[0125] The server generates optimal audio and video content using a generation AI model based on the matching results. The input is the matching results, and prompts are used to generate content suitable for babies. The output is the generated audio and video. Specific prompts reflecting concrete examples are used during generation.

[0126] Step 6:

[0127] The server sends the generated audio and video to the terminal. The input is the generated content, and the output is also that content. The transmitted content is ready for playback on the terminal.

[0128] Step 7:

[0129] The device plays back received audio and video to the baby using a presentation mechanism. The input is content data, and the output is what the baby sees and reacts to. The speaker and display actually function.

[0130] Step 8:

[0131] The user observes the baby's reactions, collects evaluation information, and feeds it back to the server. The input is the baby's reaction data, and the output is evaluation information. The server uses this information to obtain data to improve the generation procedure.

[0132] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0133] This invention provides a system that collects and analyzes baby cries, generating and outputting appropriate sounds and images based on the analysis results, combined with an emotion engine that recognizes the user's emotions. This enables optimal responses that take into account not only the baby's emotional state but also the caregiver's emotional state.

[0134] The device uses a microphone to acquire audio data in order to properly capture the baby's cries. In addition, it simultaneously collects the user's voice and facial expression data using a camera.

[0135] The audio data is sent to a server where the characteristics of crying sounds are extracted. These characteristics are compared with past cases to formulate an appropriate response. The server also uses an emotion engine to analyze the user's emotions and identify the level of stress and changes in mood.

[0136] The server generates sounds and images optimized not only for the baby's condition but also for the user's emotions. For example, if the server determines that the user is fatigued, it will select sounds that have a more relaxing effect. These results are then sent to the device.

[0137] The device plays audio and video near the baby to ensure effective functionality. The user observes the effect after playback and provides feedback through the device's interface. This feedback is returned to the server and used to improve the generation algorithm.

[0138] For example, when a parent is dealing with their baby's nighttime crying, the system can analyze the baby's cries and the parent's stress level, and then provide relaxing videos accompanied by calming music. In this way, the entire system can reduce the psychological burden on both the baby and the caregiver, and optimize the childcare environment.

[0139] The following describes the processing flow.

[0140] Step 1:

[0141] The device uses a microphone to collect the baby's cries and simultaneously records the user's facial expressions and voice using a camera. This data is stored in both audio and video formats.

[0142] Step 2:

[0143] The device performs noise reduction and normalization on the collected audio data to generate clean data. This data is prepared for analysis while retaining the characteristics of a baby's crying.

[0144] Step 3:

[0145] The device sends pre-processed baby voice data and user video data to the server. The data is transmitted using a secure communication protocol.

[0146] Step 4:

[0147] When the server receives baby voice data, it uses a voice analysis algorithm to extract features of the crying. These features include elements such as pitch, intensity, and rhythm.

[0148] Step 5:

[0149] The server analyzes the user's video data and uses an emotion engine to identify the user's emotional state (e.g., joy, fatigue, stress). This information is used to adjust the response to the baby.

[0150] Step 6:

[0151] The server considers the baby's features and the user's emotional state and matches them against a database. This allows it to compare with similar past cases and plan the generation of appropriate sounds and images.

[0152] Step 7:

[0153] The server uses generative AI technology to generate sounds and images optimized to soothe a baby. This generation reflects the user's categorized emotional information. For example, if the user is fatigued, calming music and relaxation videos will be generated.

[0154] Step 8:

[0155] The server transmits the generated audio and video to the terminal. Secure data transfer is also performed during this process.

[0156] Step 9:

[0157] The device plays back received sounds and images towards the baby. Playback is performed using a dedicated audio system and display, encouraging the baby to respond.

[0158] Step 10:

[0159] The user observes the baby's reaction during and after playback and evaluates its effectiveness. The evaluation results are entered into the device as feedback.

[0160] Step 11:

[0161] The device sends user-entered feedback to the server. This feedback is used to further improve the generation algorithm and contribute to improving the quality of future deliverables.

[0162] (Example 2)

[0163] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0164] In childcare, it is necessary to comprehensively consider both the baby's cries and emotional state, as well as the caregiver's emotional state. However, conventional techniques only address the baby's condition, making it difficult to alleviate the caregiver's mental burden. Therefore, there is a need for a system that allows for appropriate intervention considering the conditions of both the baby and the caregiver.

[0165] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0166] In this invention, the server includes acoustic data collection means for collecting acoustic data, information processing means for analyzing the acoustic data and extracting features, and video analysis means for collecting user video data and analyzing emotions. This makes it possible to generate appropriate sounds and videos that take into account the state of both the baby and the caregiver.

[0167] "Sound acquisition means" refers to devices or systems for acquiring baby cries and other sound information.

[0168] "Information processing means" refers to a device or system that has the function of analyzing acquired acoustic data and extracting characteristic information from it.

[0169] The "recording section" is a database that stores past acoustic data and feature quantities, and holds information that should be cross-referenced.

[0170] A "matching means" is a system that has the function of comparing extracted feature quantities with data stored in the recording unit and evaluating the match or similarity.

[0171] "Video analysis means" refers to a system that detects the user's facial expressions and movements to determine their emotional state.

[0172] "Generation means" refers to technologies and systems for creating appropriate audio and video content based on analysis results.

[0173] "Means of expression" refers to a device or apparatus for providing generated audio and video content to users or babies.

[0174] "Noise reduction" is the process of removing unwanted background noise and other sounds from audio data to obtain clear audio information.

[0175] "Standardization" is the process of adjusting the volume and quality of audio data to a certain standard.

[0176] "Evaluation information" refers to feedback provided by users, including their reactions to the generated content and information on areas for improvement.

[0177] "Improving the generation method" refers to the process of optimizing the audio and video generation algorithms and systems based on evaluation information.

[0178] This invention is a system for optimizing the childcare environment, providing appropriate sound and visuals by analyzing the baby's cries and the caregiver's emotional state. Its main components include a terminal, a server, and a user.

[0179] The device is equipped with a microphone to collect the baby's cries and a camera to record the caregiver's facial expressions. This device is designed to acquire and process crying and video data in real time. The acquired data is encrypted and transmitted to a server via the internet.

[0180] The server has an information processing device that extracts feature quantities from the acoustic data and compares them with the database in the recording unit. Furthermore, the server analyzes the caregiver's video data and evaluates their emotional state using an emotion engine. This allows the baby's condition and the caregiver's emotions to be considered simultaneously, and an AI model is used to create optimal acoustic and video content.

[0181] The generated content is sent from the server to the device and played near the baby. This aims to soothe the baby and reduce stress for caregivers by providing appropriate media.

[0182] For example, if a baby starts crying in the middle of the night, the system sends crying data to a server and also analyzes the caregiver's stress level. As a result, if the baby is hungry and the caregiver is exhausted, calming music and videos encouraging feeding will be played on the device.

[0183] Examples of prompts for a generative AI model include the following:

[0184] "Please tell me how to develop an algorithm that automatically provides appropriate responses based on a baby's crying."

[0185] "Please provide information on methods for creating content that takes into account the emotional state of caregivers."

[0186] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0187] Step 1:

[0188] The device uses a microphone to collect acoustic data of the baby crying. This acoustic data is acquired as a raw audio signal. The collected acoustic data is temporarily stored in the device's internal memory and preprocessed, such as audio clearing and noise reduction. As output, noise-free and volume-normalized acoustic data is generated.

[0189] Step 2:

[0190] The device uses its camera to collect user facial expression data. This results in still images or video data to understand the user's current emotional state. The facial expression data is classified into basic emotion labels (e.g., joy, surprise, sadness) using a facial recognition algorithm. The output of this process is tag information indicating the user's emotional state.

[0191] Step 3:

[0192] The device sends pre-processed acoustic data and user emotion state tags to the server. The data is securely transferred to the server via the internet. When this data arrives at the server as input, the server begins the process of extracting features from the acoustic data. The output is stored on the server as a feature vector.

[0193] Step 4:

[0194] The server matches the extracted feature vectors with the recording unit's database. A machine learning algorithm is used for this matching to identify the closest crying pattern from past cases. The output of this matching process is data indicating the type of crying, and is labeled with classifications such as "hungry" or "uncomfortable."

[0195] Step 5:

[0196] The server integrates classification labels for audio data with the user's emotional state and uses a generative AI model to generate appropriate audio and video content. For example, if a baby's crying indicates "hunger" and the user is "fatigued," the server will generate calming music and a video encouraging breastfeeding. The output consists of playable audio and video files.

[0197] Step 6:

[0198] The server sends the generated audio and video files to the device. The device receives these files and plays the audio and video near the baby. The user can review the content and adjust the volume as needed during playback.

[0199] Step 7:

[0200] Users observe their baby's reaction to the provided content and provide feedback through the device's interface. This input data is sent back to the server and used to improve the algorithm of the generation AI model. The feedback output contributes to improving the accuracy of future content generation.

[0201] (Application Example 2)

[0202] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0203] Conventional systems for dealing with crying babies only respond to the baby's voice and have the drawback of not being able to take into account the emotional state of the caregiver at the time. As a result, even if appropriate sounds and images are provided based on the baby's condition, they may not adequately address the reduction of the caregiver's stress. This problem needs to be solved, and the entire childcare environment needs to be improved.

[0204] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0205] In this invention, the server includes a recognition means for recognizing the emotional state of the caregiver, an adjustment means for adjusting the generated acoustic and visual information based on the emotional information obtained by the recognition means, and a sound acquisition means for collecting the baby's voice information. This makes it possible to provide appropriate acoustic and visual information that takes into account both the baby's state and the caregiver's emotions.

[0206] A "sound acquisition device" is a mechanism used to collect the sounds of a baby crying, and is a device that has the function of accurately capturing sound data.

[0207] An "information processing means" is a mechanism that analyzes collected audio information and extracts feature information; it is a device that extracts necessary features from audio data and obtains information useful for subsequent processing.

[0208] A "comparison means" is a mechanism for comparing extracted feature information with a recording medium to determine whether there is a match or a difference.

[0209] A "generation means" is a mechanism for creating appropriate acoustic and visual information based on the matching results, and is a device that creates content to improve the user experience.

[0210] A "playback mechanism" is a device for presenting generated audio and visual information to the user.

[0211] A "recognition device" is a mechanism designed to read the emotional state of a caregiver, and it is a device that analyzes facial expressions and voice tone to determine their psychological state.

[0212] A "regulation mechanism" is a system for optimizing generated acoustic and visual information based on the emotional information of the caregiver.

[0213] This invention provides an implementation system that offers optimal acoustic and visual information based on the conditions of both the baby and the caregiver. This section describes the hardware and software responsible for the system's operation.

[0214] The server receives audio data acquired from the terminal to collect the baby's cries. As a means of acquiring audio, the terminal is equipped with a highly sensitive microphone that accurately captures the baby's voice while suppressing ambient noise and other sounds.

[0215] The acquired audio data is sent to a server and analyzed using information processing tools. This process utilizes Google® Cloud Speech-to-Text, advanced speech recognition software, to extract characteristic audio information. The information processing tools also include functions for suppressing background noise and standardizing the audio.

[0216] The server uses the extracted feature information to compare it with data stored on the recording medium. The comparison means analyzes the patterns of the baby's cries and determines the appropriate response.

[0217] Furthermore, the server uses recognition means to analyze video data transmitted from the terminal in order to recognize the emotional state of the caregiver. IBM Watson® Tone Analyzer is used for this emotion analysis. The emotional information obtained by the recognition means is used to customize the acoustic and visual information generated by the adjustment means.

[0218] The Adobe Premiere Pro API is used to generate audio and visual information, creating content tailored to the baby's and caregiver's condition. The generated content is transmitted to the device via a playback device and output from the device to the area around the baby.

[0219] As a concrete example, when a baby starts crying, the system uses a microphone to collect the sound and sends it to a server for analysis. Simultaneously, a camera captures the caregiver's facial expressions and analyzes their emotional state. Based on the analysis results, music that helps the baby relax and videos that reduce the caregiver's stress are generated.

[0220] An example of a prompt message is: "The baby is crying. Please immediately generate relaxing music and videos, and provide them, adjusting them according to the caregiver's stress level."

[0221] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0222] Step 1:

[0223] The device uses a microphone to capture the baby's cries. It receives ambient sound data as input and uses noise reduction technology to clearly extract the baby's cries. This processing yields audio data in a specific frequency band.

[0224] Step 2:

[0225] The device sends the acquired audio data to the server. The server uses Google Cloud Speech-to-Text to analyze the audio data and extract feature information. This generates specific data such as the pattern and volume of the voice, and the duration of the crying.

[0226] Step 3:

[0227] The server compares the extracted feature information with the database on the recording medium. Using the feature information as input, it determines the corresponding action by comparing it with past crying patterns. This results in obtaining appropriate music and video pattern data.

[0228] Step 4:

[0229] The device uses a camera to capture the caregiver's facial expressions in real time and sends the video data to a server. The server uses IBM Watson Tone Analyzer to analyze the emotional state. It receives facial expression data as input and outputs information about the user's emotional state (e.g., stress level and fatigue level).

[0230] Step 5:

[0231] The server integrates voice and emotion analysis results and generates customized audio and visual information using the Adobe Premiere Pro API. Based on prompts, it utilizes a generative AI model to create relaxation music and video content suitable for babies and caregivers.

[0232] Step 6:

[0233] The device receives the generated audio and visual information and outputs it through its speaker and display. This provides soothing music and images around the baby, allowing the user to experience psychological relaxation.

[0234] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0235] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0236] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0237] [Second Embodiment]

[0238] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0239] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0240] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0241] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0242] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0243] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0244] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0245] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0246] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0247] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0248] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0249] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0250] This invention is a system that soothes a crying baby based on content generated from collecting, analyzing, and analyzing the baby's cries, and is implemented by incorporating a related program into each device.

[0251] The device is placed near the baby and collects audio data, including crying sounds, using a microphone. This collected data undergoes initial processing internally for noise reduction and audio normalization before being sent to a server.

[0252] The server analyzes the received audio data and extracts features of crying, such as pitch, intensity, and pattern. Next, these features are compared with a vast amount of past baby crying data stored in a cloud-based database. Based on the comparison results, the server identifies sound and video patterns that are considered effective in calming crying babies, drawing from similar past cases.

[0253] Based on the identified patterns, the server uses generative AI technology to generate special audio and video content tailored to the baby's preferences. This generated content is then sent back to the device.

[0254] The device plays generated sounds and images towards the baby. The parent or guardian observes the baby's reaction to the played sounds and images and determines whether they were effective. These observations are sent as feedback to the server through the device's interface.

[0255] This feedback information will be used to further improve the server's AI generation algorithm, aiming to provide a personalized calming method for each baby through repeated use. This entire process will reduce the burden on caregivers and create a more comfortable environment for babies.

[0256] The following describes the processing flow.

[0257] Step 1:

[0258] The device uses a microphone placed in the environment where the baby is present to collect the baby's cries. During the collection process, ambient sounds are also recorded and stored as data.

[0259] Step 2:

[0260] The device performs noise reduction and audio normalization on the recorded audio data. This prepares the data for analysis.

[0261] Step 3:

[0262] The terminal sends pre-processed audio data to the server. A communication network is used for transmission, providing data in real time.

[0263] Step 4:

[0264] The server analyzes the received audio data and extracts audio features (such as pitch, intensity, and temporal variation). Each feature is then quantified as digital data.

[0265] Step 5:

[0266] The server uses the extracted features to compare them with past baby data stored in the database. It searches for data with similar patterns and identifies the optimal solution.

[0267] Step 6:

[0268] The server uses database information and AI generation technology to generate sounds and images suitable for calming a crying baby. The generated results are then published as digital content.

[0269] Step 7:

[0270] The server transfers the generated audio and video data to the terminal. During this process, various protocols are used to ensure data reliability.

[0271] Step 8:

[0272] The device plays the received sound and video towards the baby. Playback is done through the speaker and display.

[0273] Step 9:

[0274] Users observe the effects of the played sounds and images on the baby, thereby evaluating the effectiveness of the sounds and images.

[0275] Step 10:

[0276] The user inputs the baby's response results as feedback into the device. This feedback is done via the UI and sent to the server.

[0277] Step 11:

[0278] The server updates its AI generation algorithm based on the received feedback information, preparing to generate more effective sound and video. This improvement process contributes to improving the system's accuracy.

[0279] (Example 1)

[0280] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0281] There is a need to reduce the burden on caregivers regarding their babies' crying and to provide the most appropriate response for each individual baby. However, conventional technology has not been sufficient to effectively analyze crying and provide appropriate content. Therefore, the challenge is to realize an efficient system that collects and analyzes babies' crying and generates and plays content optimized to soothe them.

[0282] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0283] In this invention, the server includes a voice acquisition means for collecting sound, an information processing means for analyzing the sound and extracting features, and a correspondence means for making a correspondence with accumulated information based on the features. Thereby, it becomes possible to efficiently perform individual correspondence for the crying sound of a baby, generate and reproduce appropriate content, and reduce the burden on the caregiver.

[0284] The "voice acquisition means" refers to a device or function for collecting sound from the surroundings.

[0285] The "information processing means" refers to a technology or process for analyzing the collected sound and extracting features therefrom.

[0286] The "correspondence means" refers to a function for confirming and comparing the extracted features with the accumulated information.

[0287] The "information generation means" refers to a technology for newly generating content such as sound and video based on the correspondence result.

[0288] The "information presentation means" refers to a device or function for allowing a user to view the generated content.

[0289] The "function of removing noise from sound and shaping the sound" refers to a technology for removing unnecessary noise from the voice and arranging it as accurate voice data.

[0290] The "evaluation information" refers to reaction and feedback data obtained from a user, and is used to improve the generation technology.

[0291] The present invention is a system for collecting, analyzing voice, and reproducing the generated content for a baby in order to provide an effective response to the crying sound of the baby. This system mainly includes two main components: a terminal and a server.

[0292] The device is placed near the baby and collects ambient sounds using voice acquisition technology. In this process, it is crucial to use a microphone to capture voices with high sensitivity and accuracy. The collected voice data is initially processed within the device by a voice processing library that performs noise reduction and voice normalization. This allows for accurate capture of crying even in noisy environments.

[0293] The initially processed audio data is transmitted to the server via the network. The server analyzes this audio using information processing tools and extracts features. The analysis software used is a speech recognition engine, which performs time-frequency analysis of the audio waveform. This clearly extracts features such as the pitch, intensity, and pattern of crying sounds.

[0294] After extracting features, the server compares them with a cloud-based database through a corresponding mechanism. Here, using an SQL database system, similar cases are identified by comparing and matching with a vast amount of historical data. Based on the valid patterns discovered through this matching, information generation tools utilizing generative AI technology are used to generate appropriate audio and video.

[0295] The generated content is sent back to the device and played back to the baby via an information display device. Here, speakers and displays are used to provide effective sounds and images that will capture the baby's attention. For example, if the baby is calmed by classical music, the prompt "Generate a calming melody based on classical music" can be entered into the generating AI.

[0296] Users observe their baby's reaction to the generated content and send feedback to the server via the device's interface. This feedback information is used as evaluation data to improve the generation algorithm on the server, enabling more personalized responses. In this way, an optimal and comfortable childcare environment can be provided for each individual baby.

[0297] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0298] Step 1:

[0299] The device is placed near the baby and collects ambient sounds using a microphone. The input is ambient noise, including the baby's crying. The device uses an audio processing library to denoise and normalize this audio data. This results in noise-removed and normalized crying data as output.

[0300] Step 2:

[0301] This initially processed audio data is sent to a server via the internet. The server receives this audio data as input and performs analysis using a speech recognition engine. During the analysis, features of the crying sound, specifically pitch, intensity, and pattern, are extracted. The output here is the extracted features.

[0302] Step 3:

[0303] The server matches the extracted features against a database in the cloud. The input is the extracted features, which are compared and matched against similar historical data using an SQL database system. This process identifies valid audio and video patterns from similar cases. The output is the identified audio and video patterns.

[0304] Step 4:

[0305] Based on the matching results, the server uses a generative AI model to generate new audio and video. The input is a specific pattern, which the generative AI model operates on to create special content suitable for babies. The output is the generated audio and video content.

[0306] Step 5:

[0307] The generated content is sent back to the terminal and played there. The terminal uses speakers and a display to present this content to the baby. The output is the audio and video being played, aiming to stop the baby from crying.

[0308] Step 6:

[0309] The user observes the baby's reaction to the played content and returns the result as feedback to the server through the terminal interface. The input is the baby's reaction information, based on which the server's generation AI algorithm performs self-improvement. The output is the improved generation algorithm, which can be used for more effective content generation.

[0310] (Application Example 1)

[0311] Next, Application Example 1 will be described. In the following description, the data processing device 12 is referred to as the "server", and the smart glasses 214 are referred to as the "terminal".

[0312] In modern society, reducing the burden of child-rearing is an important issue. Especially in response to a baby's crying, an effective and individualized approach is required, but traditional methods have limitations. Also, a situation where a baby won't stop crying causes stress for parents, and improvement of the child-rearing environment is needed. To solve these problems, the construction of a system that can accurately analyze a baby's crying and take appropriate actions immediately is required.

[0313] The specific processing by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0314] In this invention, the server includes an acoustic input means for collecting baby's voice information, an information processing means for analyzing the voice information and extracting attributes, and a matching means for matching the attributes with an information recording medium. This makes it possible to quickly analyze a baby's cries and respond appropriately to each individual baby.

[0315] "Acoustic input means" refers to a device used to collect sound information from the baby's surroundings, and includes microphones and sound sensors.

[0316] An "information processing device" is a device that has the function of extracting attributes based on collected audio information, and utilizes digital signal processing technology.

[0317] A "matching device" is a device for comparing and referencing extracted attributes with existing information recording media.

[0318] A "generation means" is a device that has the function of generating optimal audio or video based on the matching results, and utilizes generation AI technology.

[0319] "Presentation means" refers to a device that allows a baby to view generated audio and video, and includes speakers and displays.

[0320] A "control system" is a system for managing this series of processes and controlling the coordinated operation of each device.

[0321] An "automated machine" is a device such as a robot that automatically generates and presents sounds and images based on a baby's cries.

[0322] "Noise reduction" is a technique for removing unwanted noise from audio information and extracting clear speech.

[0323] "Speech standardization" is the process of organizing speech information collected under different circumstances and environments into a consistent format.

[0324] "Evaluation information" refers to reaction data and feedback collected from users, which is used to improve the generation process.

[0325] "Generation procedure" refers to the steps of the process or algorithm used to generate audio or video.

[0326] The system that implements this application collects and analyzes a baby's cries, and generates and presents appropriate content to soothe the baby. The entire system operates through the cooperation of three parties: the server, the terminal, and the user.

[0327] The server plays a central role in receiving and processing the baby's voice information. The collected voice information undergoes noise reduction and voice standardization on the server, and voice attributes are extracted by information processing tools. These extracted attributes are then compared with information recording media to generate optimal audio and video. A generation AI model is used as the generation tool, creating content tailored to the baby's preferences based on specific prompts.

[0328] The device is placed around the baby and collects crying sounds using an acoustic input device. This device transmits the collected audio information to a server, which receives the generated content and presents it to the baby. Speakers and displays are used as presentation methods.

[0329] Users observe the baby's reactions via their device and send evaluation information to the server. This evaluation information is used to improve the generation process.

[0330] For example, if a baby starts crying in the middle of the night, the device detects the crying and sends it to a server. A generation AI model then generates and plays calming melodies or videos of nature sounds. An example of a prompt would be, "Analyze the baby's crying and generate a calming sound with the following characteristics: gentle tone, calming melody, nature sounds."

[0331] This system provides content optimized for each individual baby, reducing the burden on caregivers and enabling the creation of a more comfortable childcare environment.

[0332] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0333] Step 1:

[0334] The device collects crying sounds using an acoustic input device placed near the baby. The input at this time is audio information including the baby's crying. The collected audio information is denoised and standardized within the device. The output is noise-free and standardized audio data.

[0335] Step 2:

[0336] The terminal sends the processed audio data to the server. The input is the audio data obtained by the terminal, and the same audio data is sent to the server. By the time the audio data arrives at the server, it is ready for analysis.

[0337] Step 3:

[0338] The server analyzes the received audio data using information processing tools and extracts audio attributes. The input is audio data, and the output is audio attributes such as pitch, intensity, and pattern. The server uses digital signal processing technology to detect audio attributes in detail.

[0339] Step 4:

[0340] The server compares the extracted audio attributes with the information recording medium. The input is the audio attributes, and the output is the result of matching them with similar attributes referenced from the information recording medium. Using this result, the server determines what kind of content is appropriate.

[0341] Step 5:

[0342] The server generates optimal audio and video content using a generation AI model based on the matching results. The input is the matching results, and prompts are used to generate content suitable for babies. The output is the generated audio and video. Specific prompts reflecting concrete examples are used during generation.

[0343] Step 6:

[0344] The server sends the generated audio and video to the terminal. The input is the generated content, and the output is also that content. The transmitted content is ready for playback on the terminal.

[0345] Step 7:

[0346] The device plays back received audio and video to the baby using a presentation mechanism. The input is content data, and the output is what the baby sees and reacts to. The speaker and display actually function.

[0347] Step 8:

[0348] The user observes the baby's reactions, collects evaluation information, and feeds it back to the server. The input is the baby's reaction data, and the output is evaluation information. The server uses this information to obtain data to improve the generation procedure.

[0349] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0350] This invention provides a system that collects and analyzes baby cries, generating and outputting appropriate sounds and images based on the analysis results, combined with an emotion engine that recognizes the user's emotions. This enables optimal responses that take into account not only the baby's emotional state but also the caregiver's emotional state.

[0351] The device uses a microphone to acquire audio data in order to properly capture the baby's cries. In addition, it simultaneously collects the user's voice and facial expression data using a camera.

[0352] The audio data is sent to a server where the characteristics of crying sounds are extracted. These characteristics are compared with past cases to formulate an appropriate response. The server also uses an emotion engine to analyze the user's emotions and identify the level of stress and changes in mood.

[0353] The server generates sounds and images optimized not only for the baby's condition but also for the user's emotions. For example, if the server determines the user is fatigued, it will select sounds that promote relaxation. These results are then sent to the device.

[0354] The device plays audio and video near the baby to ensure effective functionality. The user observes the effect after playback and provides feedback through the device's interface. This feedback is returned to the server and used to improve the generation algorithm.

[0355] For example, when a parent is dealing with their baby's nighttime crying, the system can analyze the baby's cries and the parent's stress level, and then provide relaxing videos accompanied by calming music. In this way, the entire system can reduce the psychological burden on both the baby and the caregiver, and optimize the childcare environment.

[0356] The following describes the processing flow.

[0357] Step 1:

[0358] The device uses a microphone to collect the baby's cries and simultaneously records the user's facial expressions and voice using a camera. This data is stored in both audio and video formats.

[0359] Step 2:

[0360] The device performs noise reduction and normalization on the collected audio data to generate clean data. This data is prepared for analysis while retaining the characteristics of a baby's crying.

[0361] Step 3:

[0362] The device sends pre-processed baby voice data and user video data to the server. The data is transmitted using a secure communication protocol.

[0363] Step 4:

[0364] When the server receives baby voice data, it uses a voice analysis algorithm to extract features of the crying. These features include elements such as pitch, intensity, and rhythm.

[0365] Step 5:

[0366] The server analyzes the user's video data and uses an emotion engine to identify the user's emotional state (e.g., joy, fatigue, stress). This information is used to adjust the response to the baby.

[0367] Step 6:

[0368] The server considers the baby's features and the user's emotional state and matches them against a database. This allows it to compare with similar past cases and plan the generation of appropriate sounds and images.

[0369] Step 7:

[0370] The server uses generative AI technology to generate sounds and images optimized to soothe a baby. This generation reflects the user's categorized emotional information. For example, if the user is fatigued, calming music and relaxation videos will be generated.

[0371] Step 8:

[0372] The server transmits the generated audio and video to the terminal. Secure data transfer is also performed during this process.

[0373] Step 9:

[0374] The device plays back received sounds and images towards the baby. Playback is performed using a dedicated audio system and display, encouraging the baby to respond.

[0375] Step 10:

[0376] The user observes the baby's reaction during and after playback and evaluates its effectiveness. The evaluation results are entered into the device as feedback.

[0377] Step 11:

[0378] The device sends user-entered feedback to the server. This feedback is used to further improve the generation algorithm and contribute to improving the quality of future deliverables.

[0379] (Example 2)

[0380] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0381] In childcare, it is necessary to comprehensively consider both the baby's cries and emotional state, as well as the caregiver's emotional state. However, conventional techniques only address the baby's condition, making it difficult to alleviate the caregiver's mental burden. Therefore, there is a need for a system that allows for appropriate intervention considering the conditions of both the baby and the caregiver.

[0382] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0383] In this invention, the server includes acoustic data collection means for collecting acoustic data, information processing means for analyzing the acoustic data and extracting features, and video analysis means for collecting user video data and analyzing emotions. This makes it possible to generate appropriate sounds and videos that take into account the state of both the baby and the caregiver.

[0384] "Sound acquisition means" refers to devices or systems for acquiring baby cries and other sound information.

[0385] "Information processing means" refers to a device or system that has the function of analyzing acquired acoustic data and extracting characteristic information from it.

[0386] The "recording section" is a database that stores past acoustic data and feature quantities, and holds information that should be cross-referenced.

[0387] A "matching means" is a system that has the function of comparing extracted feature quantities with data stored in the recording unit and evaluating the match or similarity.

[0388] "Video analysis means" refers to a system that detects the user's facial expressions and movements to determine their emotional state.

[0389] "Generation means" refers to technologies and systems for creating appropriate audio and video content based on analysis results.

[0390] "Means of expression" refers to a device or apparatus for providing generated audio and video content to users or babies.

[0391] "Noise reduction" is the process of removing unwanted background noise and other sounds from audio data to obtain clear audio information.

[0392] "Standardization" is the process of adjusting the volume and quality of audio data to a certain standard.

[0393] "Evaluation information" refers to feedback provided by users, including their reactions to the generated content and information on areas for improvement.

[0394] "Improving the generation method" refers to the process of optimizing the audio and video generation algorithms and systems based on evaluation information.

[0395] This invention is a system for optimizing the childcare environment, providing appropriate sound and visuals by analyzing the baby's cries and the caregiver's emotional state. Its main components include a terminal, a server, and a user.

[0396] The device is equipped with a microphone to collect the baby's cries and a camera to record the caregiver's facial expressions. This device is designed to acquire and process crying and video data in real time. The acquired data is encrypted and transmitted to a server via the internet.

[0397] The server has an information processing device that extracts feature quantities from the acoustic data and compares them with the database in the recording unit. Furthermore, the server analyzes the caregiver's video data and evaluates their emotional state using an emotion engine. This allows the baby's condition and the caregiver's emotions to be considered simultaneously, and an AI model is used to create optimal acoustic and video content.

[0398] The generated content is sent from the server to the device and played near the baby. This aims to soothe the baby and reduce stress for caregivers by providing appropriate media.

[0399] For example, if a baby starts crying in the middle of the night, the system sends crying data to a server and also analyzes the caregiver's stress level. As a result, if the baby is hungry and the caregiver is exhausted, calming music and videos encouraging feeding will be played on the device.

[0400] Examples of prompts for a generative AI model include the following:

[0401] "Please tell me how to develop an algorithm that automatically provides appropriate responses based on a baby's crying."

[0402] "Please provide information on methods for creating content that takes into account the emotional state of caregivers."

[0403] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0404] Step 1:

[0405] The device uses a microphone to collect acoustic data of the baby crying. This acoustic data is acquired as a raw audio signal. The collected acoustic data is temporarily stored in the device's internal memory and preprocessed, such as audio clearing and noise reduction. As output, noise-free and volume-normalized acoustic data is generated.

[0406] Step 2:

[0407] The device uses its camera to collect user facial expression data. This results in still images or video data to understand the user's current emotional state. The facial expression data is classified into basic emotion labels (e.g., joy, surprise, sadness) using a facial recognition algorithm. The output of this process is tag information indicating the user's emotional state.

[0408] Step 3:

[0409] The device sends pre-processed acoustic data and user emotion state tags to the server. The data is securely transferred to the server via the internet. When this data arrives at the server as input, the server begins the process of extracting features from the acoustic data. The output is stored on the server as a feature vector.

[0410] Step 4:

[0411] The server matches the extracted feature vectors with the recording unit's database. A machine learning algorithm is used for this matching to identify the closest crying pattern from past cases. The output of this matching process is data indicating the type of crying, and is labeled with classifications such as "hungry" or "uncomfortable."

[0412] Step 5:

[0413] The server integrates classification labels for audio data with the user's emotional state and uses a generative AI model to generate appropriate audio and video content. For example, if a baby's crying indicates "hunger" and the user is "fatigued," the server will generate calming music and a video encouraging breastfeeding. The output consists of playable audio and video files.

[0414] Step 6:

[0415] The server sends the generated audio and video files to the device. The device receives these files and plays the audio and video near the baby. The user can review the content and adjust the volume as needed during playback.

[0416] Step 7:

[0417] Users observe their baby's reaction to the provided content and provide feedback through the device's interface. This input data is sent back to the server and used to improve the algorithm of the generation AI model. The feedback output contributes to improving the accuracy of future content generation.

[0418] (Application Example 2)

[0419] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0420] Conventional systems for dealing with crying babies only respond to the baby's voice and have the drawback of not being able to take into account the emotional state of the caregiver at the time. As a result, even if appropriate sounds and images are provided based on the baby's condition, they may not adequately address the reduction of the caregiver's stress. This problem needs to be solved, and the entire childcare environment needs to be improved.

[0421] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0422] In this invention, the server includes a recognition means for recognizing the emotional state of the caregiver, an adjustment means for adjusting the generated acoustic and visual information based on the emotional information obtained by the recognition means, and a sound acquisition means for collecting the baby's voice information. This makes it possible to provide appropriate acoustic and visual information that takes into account both the baby's state and the caregiver's emotions.

[0423] A "sound acquisition device" is a mechanism used to collect the sounds of a baby crying, and is a device that has the function of accurately capturing sound data.

[0424] An "information processing means" is a mechanism that analyzes collected audio information and extracts feature information; it is a device that extracts necessary features from audio data and obtains information useful for subsequent processing.

[0425] A "comparison means" is a mechanism for comparing extracted feature information with a recording medium to determine whether there is a match or a difference.

[0426] A "generation means" is a mechanism for creating appropriate acoustic and visual information based on the matching results, and is a device that creates content to improve the user experience.

[0427] A "playback mechanism" is a device for presenting generated audio and visual information to the user.

[0428] A "recognition device" is a mechanism designed to read the emotional state of a caregiver, and it is a device that analyzes facial expressions and voice tone to determine their psychological state.

[0429] A "regulation mechanism" is a system for optimizing generated acoustic and visual information based on the emotional information of the caregiver.

[0430] This invention provides an implementation system that offers optimal acoustic and visual information based on the conditions of both the baby and the caregiver. This section describes the hardware and software responsible for the system's operation.

[0431] The server receives audio data acquired from the terminal to collect the baby's cries. As a means of acquiring audio, the terminal is equipped with a highly sensitive microphone that accurately captures the baby's voice while suppressing ambient noise and other sounds.

[0432] The acquired audio data is sent to a server and analyzed using information processing tools. This process utilizes Google Cloud Speech-to-Text, advanced speech recognition software, to extract characteristic audio information. The information processing tools also include functions for suppressing background noise and standardizing the audio.

[0433] The server uses the extracted feature information to compare it with data stored on the recording medium. The comparison means analyzes the patterns of the baby's cries and determines the appropriate response.

[0434] Furthermore, the server uses recognition means to analyze video data transmitted from the terminal in order to recognize the emotional state of the caregiver. IBM Watson Tone Analyzer is used for this emotion analysis. The emotional information obtained by the recognition means is used to customize the acoustic and visual information generated by the adjustment means.

[0435] The Adobe Premiere Pro API is used to generate audio and visual information, creating content tailored to the baby's and caregiver's condition. The generated content is transmitted to the device via a playback device and output from the device to the area around the baby.

[0436] As a concrete example, when a baby starts crying, the system uses a microphone to collect the sound and sends it to a server for analysis. Simultaneously, a camera captures the caregiver's facial expressions and analyzes their emotional state. Based on the analysis results, music that helps the baby relax and videos that reduce the caregiver's stress are generated.

[0437] An example of a prompt message is: "The baby is crying. Please immediately generate relaxing music and videos, and provide them, adjusting them according to the caregiver's stress level."

[0438] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0439] Step 1:

[0440] The device uses a microphone to capture the baby's cries. It receives ambient sound data as input and uses noise reduction technology to clearly extract the baby's cries. This processing yields audio data in a specific frequency band.

[0441] Step 2:

[0442] The device sends the acquired audio data to the server. The server uses Google Cloud Speech-to-Text to analyze the audio data and extract feature information. This generates specific data such as the pattern and volume of the voice, and the duration of the crying.

[0443] Step 3:

[0444] The server compares the extracted feature information with the database on the recording medium. Using the feature information as input, it determines the corresponding action by comparing it with past crying patterns. This results in obtaining appropriate music and video pattern data.

[0445] Step 4:

[0446] The device uses a camera to capture the caregiver's facial expressions in real time and sends the video data to a server. The server uses IBM Watson Tone Analyzer to analyze the emotional state. It receives facial expression data as input and outputs information about the user's emotional state (e.g., stress level and fatigue level).

[0447] Step 5:

[0448] The server integrates voice and emotion analysis results and generates customized audio and visual information using the Adobe Premiere Pro API. Based on prompts, it utilizes a generative AI model to create relaxation music and video content suitable for babies and caregivers.

[0449] Step 6:

[0450] The device receives the generated audio and visual information and outputs it through its speaker and display. This provides soothing music and images around the baby, allowing the user to experience psychological relaxation.

[0451] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0452] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0453] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0454] [Third Embodiment]

[0455] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0456] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0457] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0458] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0459] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0460] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0461] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0462] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0463] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0464] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0465] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0466] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0467] This invention is a system that soothes a crying baby based on content generated from collecting, analyzing, and analyzing the baby's cries, and is implemented by incorporating a related program into each device.

[0468] The device is placed near the baby and collects audio data, including crying sounds, using a microphone. This collected data undergoes initial processing internally for noise reduction and audio normalization before being sent to a server.

[0469] The server analyzes the received audio data and extracts features of crying, such as pitch, intensity, and pattern. Next, these features are compared with a vast amount of past baby crying data stored in a cloud-based database. Based on the comparison results, the server identifies sound and video patterns that are considered effective in calming crying babies, drawing from similar past cases.

[0470] Based on the identified patterns, the server uses generative AI technology to generate special audio and video content tailored to the baby's preferences. This generated content is then sent back to the device.

[0471] The device plays generated sounds and images towards the baby. The parent or guardian observes the baby's reaction to the played sounds and images and determines whether they were effective. These observations are sent as feedback to the server through the device's interface.

[0472] This feedback information will be used to further improve the server's AI generation algorithm, aiming to provide a personalized calming method for each baby through repeated use. This entire process will reduce the burden on caregivers and create a more comfortable environment for babies.

[0473] The following describes the processing flow.

[0474] Step 1:

[0475] The device uses a microphone placed in the environment where the baby is present to collect the baby's cries. During the collection process, ambient sounds are also recorded and stored as data.

[0476] Step 2:

[0477] The device performs noise reduction and audio normalization on the recorded audio data. This prepares the data for analysis.

[0478] Step 3:

[0479] The terminal sends pre-processed audio data to the server. A communication network is used for transmission, providing data in real time.

[0480] Step 4:

[0481] The server analyzes the received audio data and extracts audio features (such as pitch, intensity, and temporal variation). Each feature is then quantified as digital data.

[0482] Step 5:

[0483] The server uses the extracted features to compare them with past baby data stored in the database. It searches for data with similar patterns and identifies the optimal solution.

[0484] Step 6:

[0485] The server uses database information and AI generation technology to generate sounds and images suitable for calming a crying baby. The generated results are then published as digital content.

[0486] Step 7:

[0487] The server transfers the generated audio and video data to the terminal. During this process, various protocols are used to ensure data reliability.

[0488] Step 8:

[0489] The device plays the received sound and video towards the baby. Playback is done through the speaker and display.

[0490] Step 9:

[0491] Users observe the effects of the played sounds and images on the baby, thereby evaluating the effectiveness of the sounds and images.

[0492] Step 10:

[0493] The user inputs the baby's response results as feedback into the device. This feedback is done via the UI and sent to the server.

[0494] Step 11:

[0495] The server updates its AI generation algorithm based on the received feedback information, preparing to generate more effective sound and video. This improvement process contributes to improving the system's accuracy.

[0496] (Example 1)

[0497] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0498] There is a need to reduce the burden on caregivers regarding their babies' crying and to provide the most appropriate response for each individual baby. However, conventional technology has not been sufficient to effectively analyze crying and provide appropriate content. Therefore, the challenge is to realize an efficient system that collects and analyzes babies' crying and generates and plays content optimized to soothe them.

[0499] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0500] In this invention, the server includes a sound acquisition means for collecting sound, an information processing means for analyzing the sound and extracting features, and a correspondence means for matching the features with stored information. This makes it possible to efficiently respond to a baby's crying, generate and play appropriate content, and reduce the burden on caregivers.

[0501] "Sound acquisition means" refers to devices or functions for collecting sounds from the surrounding environment.

[0502] "Information processing means" refers to the technologies and processes used to analyze collected sounds and extract features from them.

[0503] "Means of response" refers to a function that confirms and compares the extracted features with the accumulated information.

[0504] "Information generation means" refers to technology that generates new content such as audio and video based on the results of the correspondence.

[0505] "Information presentation means" refers to devices or functions that allow users to view generated content.

[0506] The "function to remove noise and shape sound" refers to technology that removes unwanted noise from audio and prepares it as accurate audio data.

[0507] "Evaluation information" refers to reactions and feedback data obtained from users, and is used to improve generation technology.

[0508] This invention relates to a system that collects, analyzes, and plays back generated content to a baby in order to provide an effective response to a baby's crying. This system mainly consists of two main components: a terminal and a server.

[0509] The device is placed near the baby and collects ambient sounds using voice acquisition technology. In this process, it is crucial to use a microphone to capture voices with high sensitivity and accuracy. The collected voice data is initially processed within the device by a voice processing library that performs noise reduction and voice normalization. This allows for accurate capture of crying even in noisy environments.

[0510] The initially processed audio data is transmitted to the server via the network. The server analyzes this audio using information processing tools and extracts features. The analysis software used is a speech recognition engine, which performs time-frequency analysis of the audio waveform. This clearly extracts features such as the pitch, intensity, and pattern of crying sounds.

[0511] After extracting features, the server compares them with a cloud-based database through a corresponding mechanism. Here, using an SQL database system, similar cases are identified by comparing and matching with a vast amount of historical data. Based on the valid patterns discovered through this matching, information generation tools utilizing generative AI technology are used to generate appropriate audio and video.

[0512] The generated content is sent back to the device and played back to the baby via an information display device. Here, speakers and displays are used to provide effective sounds and images that will capture the baby's attention. For example, if the baby is calmed by classical music, the prompt "Generate a calming melody based on classical music" can be entered into the generating AI.

[0513] Users observe their baby's reaction to the generated content and send feedback to the server via the device's interface. This feedback information is used as evaluation data to improve the generation algorithm on the server, enabling more personalized responses. In this way, an optimal and comfortable childcare environment can be provided for each individual baby.

[0514] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0515] Step 1:

[0516] The device is placed near the baby and collects ambient sounds using a microphone. The input is ambient noise, including the baby's crying. The device uses an audio processing library to denoise and normalize this audio data. This results in noise-removed and normalized crying data as output.

[0517] Step 2:

[0518] This initially processed audio data is sent to a server via the internet. The server receives this audio data as input and performs analysis using a speech recognition engine. During the analysis, features of the crying sound, specifically pitch, intensity, and pattern, are extracted. The output here is the extracted features.

[0519] Step 3:

[0520] The server matches the extracted features against a database in the cloud. The input is the extracted features, which are compared and matched against similar historical data using an SQL database system. This process identifies valid audio and video patterns from similar cases. The output is the identified audio and video patterns.

[0521] Step 4:

[0522] Based on the matching results, the server uses a generative AI model to generate new audio and video. The input is a specific pattern, which the generative AI model operates on to create special content suitable for babies. The output is the generated audio and video content.

[0523] Step 5:

[0524] The generated content is sent back to the device and played there. The device uses its speaker and display to present this content to the baby. The output is the audio and video being played, and the goal is to soothe the crying baby.

[0525] Step 6:

[0526] The user observes the baby's reaction to the played content and sends the results back to the server as feedback through the device's interface. The input is the baby's reaction information, which the server's generation AI algorithm uses to improve itself. The output is the improved generation algorithm, which can be used to generate more effective content.

[0527] (Application Example 1)

[0528] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0529] In modern society, reducing the burden of childcare is a crucial issue. In particular, effective and individualized approaches are needed to address a baby's crying, but traditional methods have their limitations. Furthermore, a baby's persistent crying is a source of parental stress, highlighting the need for improvements to the childcare environment. To address these challenges, a system is needed that can accurately analyze a baby's crying and provide appropriate responses instantly.

[0530] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0531] In this invention, the server includes an acoustic input means for collecting baby's voice information, an information processing means for analyzing the voice information and extracting attributes, and a matching means for matching the attributes with an information recording medium. This makes it possible to quickly analyze a baby's cries and respond appropriately to each individual baby.

[0532] "Acoustic input means" refers to a device used to collect sound information from the baby's surroundings, and includes microphones and sound sensors.

[0533] An "information processing device" is a device that has the function of extracting attributes based on collected audio information, and utilizes digital signal processing technology.

[0534] A "matching device" is a device for comparing and referencing extracted attributes with existing information recording media.

[0535] A "generation means" is a device that has the function of generating optimal audio or video based on the matching results, and utilizes generation AI technology.

[0536] "Presentation means" refers to a device that allows a baby to view generated audio and video, and includes speakers and displays.

[0537] A "control system" is a system for managing this series of processes and controlling the coordinated operation of each device.

[0538] An "automated machine" is a device such as a robot that automatically generates and presents sounds and images based on a baby's cries.

[0539] "Noise reduction" is a technique for removing unwanted noise from audio information and extracting clear speech.

[0540] "Speech standardization" is the process of organizing speech information collected under different circumstances and environments into a consistent format.

[0541] "Evaluation information" refers to reaction data and feedback collected from users, which is used to improve the generation process.

[0542] "Generation procedure" refers to the steps of the process or algorithm used to generate audio or video.

[0543] The system that implements this application collects and analyzes a baby's cries, and generates and presents appropriate content to soothe the baby. The entire system operates through the cooperation of three parties: the server, the terminal, and the user.

[0544] The server plays a central role in receiving and processing the baby's voice information. The collected voice information undergoes noise reduction and voice standardization on the server, and voice attributes are extracted by information processing tools. These extracted attributes are then compared with information recording media to generate optimal audio and video. A generation AI model is used as the generation tool, creating content tailored to the baby's preferences based on specific prompts.

[0545] The device is placed around the baby and collects crying sounds using an acoustic input device. This device transmits the collected audio information to a server, which receives the generated content and presents it to the baby. Speakers and displays are used as presentation methods.

[0546] Users observe the baby's reactions via their device and send evaluation information to the server. This evaluation information is used to improve the generation process.

[0547] For example, if a baby starts crying in the middle of the night, the device detects the crying and sends it to a server. A generation AI model then generates and plays calming melodies or videos of nature sounds. An example of a prompt would be, "Analyze the baby's crying and generate a calming sound with the following characteristics: gentle tone, calming melody, nature sounds."

[0548] This system provides content optimized for each individual baby, reducing the burden on caregivers and enabling the creation of a more comfortable childcare environment.

[0549] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0550] Step 1:

[0551] The device collects crying sounds using an acoustic input device placed near the baby. The input at this time is audio information including the baby's crying. The collected audio information is denoised and standardized within the device. The output is noise-free and standardized audio data.

[0552] Step 2:

[0553] The terminal sends the processed audio data to the server. The input is the audio data obtained by the terminal, and the same audio data is sent to the server. By the time the audio data arrives at the server, it is ready for analysis.

[0554] Step 3:

[0555] The server analyzes the received audio data using information processing tools and extracts audio attributes. The input is audio data, and the output is audio attributes such as pitch, intensity, and pattern. The server uses digital signal processing technology to detect audio attributes in detail.

[0556] Step 4:

[0557] The server compares the extracted audio attributes with the information recording medium. The input is the audio attributes, and the output is the result of matching them with similar attributes referenced from the information recording medium. Using this result, the server determines what kind of content is appropriate.

[0558] Step 5:

[0559] The server generates optimal audio and video content using a generation AI model based on the matching results. The input is the matching results, and prompts are used to generate content suitable for babies. The output is the generated audio and video. Specific prompts reflecting concrete examples are used during generation.

[0560] Step 6:

[0561] The server sends the generated audio and video to the terminal. The input is the generated content, and the output is also that content. The transmitted content is ready for playback on the terminal.

[0562] Step 7:

[0563] The device plays back received audio and video to the baby using a presentation mechanism. The input is content data, and the output is what the baby sees and reacts to. The speaker and display actually function.

[0564] Step 8:

[0565] The user observes the baby's reactions, collects evaluation information, and feeds it back to the server. The input is the baby's reaction data, and the output is evaluation information. The server uses this information to obtain data to improve the generation procedure.

[0566] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0567] This invention provides a system that collects and analyzes baby cries, generating and outputting appropriate sounds and images based on the analysis results, combined with an emotion engine that recognizes the user's emotions. This enables optimal responses that take into account not only the baby's emotional state but also the caregiver's emotional state.

[0568] The device uses a microphone to acquire audio data in order to properly capture the baby's cries. In addition, it simultaneously collects the user's voice and facial expression data using a camera.

[0569] The audio data is sent to a server where the characteristics of crying sounds are extracted. These characteristics are compared with past cases to formulate an appropriate response. The server also uses an emotion engine to analyze the user's emotions and identify the level of stress and changes in mood.

[0570] The server generates sounds and images optimized not only for the baby's condition but also for the user's emotions. For example, if the server determines the user is fatigued, it will select sounds that promote relaxation. These results are then sent to the device.

[0571] The device plays audio and video near the baby to ensure effective functionality. The user observes the effect after playback and provides feedback through the device's interface. This feedback is returned to the server and used to improve the generation algorithm.

[0572] For example, when a parent is dealing with their baby's nighttime crying, the system can analyze the baby's cries and the parent's stress level, and then provide relaxing videos accompanied by calming music. In this way, the entire system can reduce the psychological burden on both the baby and the caregiver, and optimize the childcare environment.

[0573] The following describes the processing flow.

[0574] Step 1:

[0575] The device uses a microphone to collect the baby's cries and simultaneously records the user's facial expressions and voice using a camera. This data is stored in both audio and video formats.

[0576] Step 2:

[0577] The device performs noise reduction and normalization on the collected audio data to generate clean data. This data is prepared for analysis while retaining the characteristics of a baby's crying.

[0578] Step 3:

[0579] The device sends pre-processed baby voice data and user video data to the server. The data is transmitted using a secure communication protocol.

[0580] Step 4:

[0581] When the server receives baby voice data, it uses a voice analysis algorithm to extract features of the crying. These features include elements such as pitch, intensity, and rhythm.

[0582] Step 5:

[0583] The server analyzes the user's video data and uses an emotion engine to identify the user's emotional state (e.g., joy, fatigue, stress). This information is used to adjust the response to the baby.

[0584] Step 6:

[0585] The server considers the baby's features and the user's emotional state and matches them against a database. This allows it to compare with similar past cases and plan the generation of appropriate sounds and images.

[0586] Step 7:

[0587] The server uses generative AI technology to generate sounds and images optimized to soothe a baby. This generation reflects the user's categorized emotional information. For example, if the user is fatigued, calming music and relaxation videos will be generated.

[0588] Step 8:

[0589] The server transmits the generated audio and video to the terminal. Secure data transfer is also performed during this process.

[0590] Step 9:

[0591] The device plays back received sounds and images towards the baby. Playback is performed using a dedicated audio system and display, encouraging the baby to respond.

[0592] Step 10:

[0593] The user observes the baby's reaction during and after playback and evaluates its effectiveness. The evaluation results are entered into the device as feedback.

[0594] Step 11:

[0595] The device sends user-entered feedback to the server. This feedback is used to further improve the generation algorithm and contribute to improving the quality of future deliverables.

[0596] (Example 2)

[0597] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0598] In childcare, it is necessary to comprehensively consider both the baby's cries and emotional state, as well as the caregiver's emotional state. However, conventional techniques only address the baby's condition, making it difficult to alleviate the caregiver's mental burden. Therefore, there is a need for a system that allows for appropriate intervention considering the conditions of both the baby and the caregiver.

[0599] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0600] In this invention, the server includes acoustic data collection means for collecting acoustic data, information processing means for analyzing the acoustic data and extracting features, and video analysis means for collecting user video data and analyzing emotions. This makes it possible to generate appropriate sounds and videos that take into account the state of both the baby and the caregiver.

[0601] "Sound acquisition means" refers to devices or systems for acquiring baby cries and other sound information.

[0602] "Information processing means" refers to a device or system that has the function of analyzing acquired acoustic data and extracting characteristic information from it.

[0603] The "recording section" is a database that stores past acoustic data and feature quantities, and holds information that should be cross-referenced.

[0604] A "matching means" is a system that has the function of comparing extracted feature quantities with data stored in the recording unit and evaluating the match or similarity.

[0605] "Video analysis means" refers to a system that detects the user's facial expressions and movements to determine their emotional state.

[0606] "Generation means" refers to technologies and systems for creating appropriate audio and video content based on analysis results.

[0607] "Means of expression" refers to a device or apparatus for providing generated audio and video content to users or babies.

[0608] "Noise reduction" is the process of removing unwanted background noise and other sounds from audio data to obtain clear audio information.

[0609] "Standardization" is the process of adjusting the volume and quality of audio data to a certain standard.

[0610] "Evaluation information" refers to feedback provided by users, including their reactions to the generated content and information on areas for improvement.

[0611] "Improving the generation method" refers to the process of optimizing the audio and video generation algorithms and systems based on evaluation information.

[0612] This invention is a system for optimizing the childcare environment, providing appropriate sound and visuals by analyzing the baby's cries and the caregiver's emotional state. Its main components include a terminal, a server, and a user.

[0613] The device is equipped with a microphone to collect the baby's cries and a camera to record the caregiver's facial expressions. This device is designed to acquire and process crying and video data in real time. The acquired data is encrypted and transmitted to a server via the internet.

[0614] The server has an information processing device that extracts feature quantities from the acoustic data and compares them with the database in the recording unit. Furthermore, the server analyzes the caregiver's video data and evaluates their emotional state using an emotion engine. This allows the baby's condition and the caregiver's emotions to be considered simultaneously, and an AI model is used to create optimal acoustic and video content.

[0615] The generated content is sent from the server to the device and played near the baby. This aims to soothe the baby and reduce stress for caregivers by providing appropriate media.

[0616] For example, if a baby starts crying in the middle of the night, the system sends crying data to a server and also analyzes the caregiver's stress level. As a result, if the baby is hungry and the caregiver is exhausted, calming music and videos encouraging feeding will be played on the device.

[0617] Examples of prompts for a generative AI model include the following:

[0618] "Please tell me how to develop an algorithm that automatically provides appropriate responses based on a baby's crying."

[0619] "Please provide information on methods for creating content that takes into account the emotional state of caregivers."

[0620] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0621] Step 1:

[0622] The device uses a microphone to collect acoustic data of the baby crying. This acoustic data is acquired as a raw audio signal. The collected acoustic data is temporarily stored in the device's internal memory and preprocessed, such as audio clearing and noise reduction. As output, noise-free and volume-normalized acoustic data is generated.

[0623] Step 2:

[0624] The device uses its camera to collect user facial expression data. This results in still images or video data to understand the user's current emotional state. The facial expression data is classified into basic emotion labels (e.g., joy, surprise, sadness) using a facial recognition algorithm. The output of this process is tag information indicating the user's emotional state.

[0625] Step 3:

[0626] The device sends pre-processed acoustic data and user emotion state tags to the server. The data is securely transferred to the server via the internet. When this data arrives at the server as input, the server begins the process of extracting features from the acoustic data. The output is stored on the server as a feature vector.

[0627] Step 4:

[0628] The server matches the extracted feature vectors with the recording unit's database. A machine learning algorithm is used for this matching to identify the closest crying pattern from past cases. The output of this matching process is data indicating the type of crying, and is labeled with classifications such as "hungry" or "uncomfortable."

[0629] Step 5:

[0630] The server integrates classification labels for audio data with the user's emotional state and uses a generative AI model to generate appropriate audio and video content. For example, if a baby's crying indicates "hunger" and the user is "fatigued," the server will generate calming music and a video encouraging breastfeeding. The output consists of playable audio and video files.

[0631] Step 6:

[0632] The server sends the generated audio and video files to the device. The device receives these files and plays the audio and video near the baby. The user can review the content and adjust the volume as needed during playback.

[0633] Step 7:

[0634] Users observe their baby's reaction to the provided content and provide feedback through the device's interface. This input data is sent back to the server and used to improve the algorithm of the generation AI model. The feedback output contributes to improving the accuracy of future content generation.

[0635] (Application Example 2)

[0636] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0637] Conventional systems for dealing with crying babies only respond to the baby's voice and have the drawback of not being able to take into account the emotional state of the caregiver at the time. As a result, even if appropriate sounds and images are provided based on the baby's condition, they may not adequately address the reduction of the caregiver's stress. This problem needs to be solved, and the entire childcare environment needs to be improved.

[0638] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0639] In this invention, the server includes a recognition means for recognizing the emotional state of the caregiver, an adjustment means for adjusting the generated acoustic and visual information based on the emotional information obtained by the recognition means, and a sound acquisition means for collecting the baby's voice information. This makes it possible to provide appropriate acoustic and visual information that takes into account both the baby's state and the caregiver's emotions.

[0640] A "sound acquisition device" is a mechanism used to collect the sounds of a baby crying, and is a device that has the function of accurately capturing sound data.

[0641] An "information processing means" is a mechanism that analyzes collected audio information and extracts feature information; it is a device that extracts necessary features from audio data and obtains information useful for subsequent processing.

[0642] A "comparison means" is a mechanism for comparing extracted feature information with a recording medium to determine whether there is a match or a difference.

[0643] A "generation means" is a mechanism for creating appropriate acoustic and visual information based on the matching results, and is a device that creates content to improve the user experience.

[0644] A "playback mechanism" is a device for presenting generated audio and visual information to the user.

[0645] A "recognition device" is a mechanism designed to read the emotional state of a caregiver, and it is a device that analyzes facial expressions and voice tone to determine their psychological state.

[0646] A "regulation mechanism" is a system for optimizing generated acoustic and visual information based on the emotional information of the caregiver.

[0647] This invention provides an implementation system that offers optimal acoustic and visual information based on the conditions of both the baby and the caregiver. This section describes the hardware and software responsible for the system's operation.

[0648] The server receives audio data acquired from the terminal to collect the baby's cries. As a means of acquiring audio, the terminal is equipped with a highly sensitive microphone that accurately captures the baby's voice while suppressing ambient noise and other sounds.

[0649] The acquired audio data is sent to a server and analyzed using information processing tools. This process utilizes Google Cloud Speech-to-Text, advanced speech recognition software, to extract characteristic audio information. The information processing tools also include functions for suppressing background noise and standardizing the audio.

[0650] The server uses the extracted feature information to compare it with data stored on the recording medium. The comparison means analyzes the patterns of the baby's cries and determines the appropriate response.

[0651] Furthermore, the server uses recognition means to analyze video data transmitted from the terminal in order to recognize the emotional state of the caregiver. IBM Watson Tone Analyzer is used for this emotion analysis. The emotional information obtained by the recognition means is used to customize the acoustic and visual information generated by the adjustment means.

[0652] The Adobe Premiere Pro API is used to generate audio and visual information, creating content tailored to the baby's and caregiver's condition. The generated content is transmitted to the device via a playback device and output from the device to the area around the baby.

[0653] As a concrete example, when a baby starts crying, the system uses a microphone to collect the sound and sends it to a server for analysis. Simultaneously, a camera captures the caregiver's facial expressions and analyzes their emotional state. Based on the analysis results, music that helps the baby relax and videos that reduce the caregiver's stress are generated.

[0654] An example of a prompt message is: "The baby is crying. Please immediately generate relaxing music and videos, and provide them, adjusting them according to the caregiver's stress level."

[0655] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0656] Step 1:

[0657] The device uses a microphone to capture the baby's cries. It receives ambient sound data as input and uses noise reduction technology to clearly extract the baby's cries. This processing yields audio data in a specific frequency band.

[0658] Step 2:

[0659] The device sends the acquired audio data to the server. The server uses Google Cloud Speech-to-Text to analyze the audio data and extract feature information. This generates specific data such as the pattern and volume of the voice, and the duration of the crying.

[0660] Step 3:

[0661] The server compares the extracted feature information with the database on the recording medium. Using the feature information as input, it determines the corresponding action by comparing it with past crying patterns. This results in obtaining appropriate music and video pattern data.

[0662] Step 4:

[0663] The device uses a camera to capture the caregiver's facial expressions in real time and sends the video data to a server. The server uses IBM Watson Tone Analyzer to analyze the emotional state. It receives facial expression data as input and outputs information about the user's emotional state (e.g., stress level and fatigue level).

[0664] Step 5:

[0665] The server integrates voice and emotion analysis results and generates customized audio and visual information using the Adobe Premiere Pro API. Based on prompts, it utilizes a generative AI model to create relaxation music and video content suitable for babies and caregivers.

[0666] Step 6:

[0667] The device receives the generated audio and visual information and outputs it through its speaker and display. This provides soothing music and images around the baby, allowing the user to experience psychological relaxation.

[0668] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0669] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0670] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0671] [Fourth Embodiment]

[0672] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0673] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0674] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0675] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0676] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0677] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0678] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0679] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0680] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0681] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0682] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0683] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0684] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0685] This invention is a system that soothes a crying baby based on content generated from collecting, analyzing, and analyzing the baby's cries, and is implemented by incorporating a related program into each device.

[0686] The device is placed near the baby and collects audio data, including crying sounds, using a microphone. This collected data undergoes initial processing internally for noise reduction and audio normalization before being sent to a server.

[0687] The server analyzes the received audio data and extracts features of crying, such as pitch, intensity, and pattern. Next, these features are compared with a vast amount of past baby crying data stored in a cloud-based database. Based on the comparison results, the server identifies sound and video patterns that are considered effective in calming crying babies, drawing from similar past cases.

[0688] Based on the identified patterns, the server uses generative AI technology to generate special audio and video content tailored to the baby's preferences. This generated content is then sent back to the device.

[0689] The device plays generated sounds and images towards the baby. The parent or guardian observes the baby's reaction to the played sounds and images and determines whether they were effective. These observations are sent as feedback to the server through the device's interface.

[0690] This feedback information will be used to further improve the server's AI generation algorithm, aiming to provide a personalized calming method for each baby through repeated use. This entire process will reduce the burden on caregivers and create a more comfortable environment for babies.

[0691] The following describes the processing flow.

[0692] Step 1:

[0693] The device uses a microphone placed in the environment where the baby is present to collect the baby's cries. During the collection process, ambient sounds are also recorded and stored as data.

[0694] Step 2:

[0695] The device performs noise reduction and audio normalization on the recorded audio data. This prepares the data for analysis.

[0696] Step 3:

[0697] The terminal sends pre-processed audio data to the server. A communication network is used for transmission, providing data in real time.

[0698] Step 4:

[0699] The server analyzes the received audio data and extracts audio features (such as pitch, intensity, and temporal variation). Each feature is then quantified as digital data.

[0700] Step 5:

[0701] The server uses the extracted features to compare them with past baby data stored in the database. It searches for data with similar patterns and identifies the optimal solution.

[0702] Step 6:

[0703] The server uses database information and AI generation technology to generate sounds and images suitable for calming a crying baby. The generated results are then published as digital content.

[0704] Step 7:

[0705] The server transfers the generated audio and video data to the terminal. During this process, various protocols are used to ensure data reliability.

[0706] Step 8:

[0707] The device plays the received sound and video towards the baby. Playback is done through the speaker and display.

[0708] Step 9:

[0709] Users observe the effects of the played sounds and images on the baby, thereby evaluating the effectiveness of the sounds and images.

[0710] Step 10:

[0711] The user inputs the baby's response results as feedback into the device. This feedback is done via the UI and sent to the server.

[0712] Step 11:

[0713] The server updates its AI generation algorithm based on the received feedback information, preparing to generate more effective sound and video. This improvement process contributes to improving the system's accuracy.

[0714] (Example 1)

[0715] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0716] There is a need to reduce the burden on caregivers regarding their babies' crying and to provide the most appropriate response for each individual baby. However, conventional technology has not been sufficient to effectively analyze crying and provide appropriate content. Therefore, the challenge is to realize an efficient system that collects and analyzes babies' crying and generates and plays content optimized to soothe them.

[0717] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0718] In this invention, the server includes a sound acquisition means for collecting sound, an information processing means for analyzing the sound and extracting features, and a correspondence means for matching the features with stored information. This makes it possible to efficiently respond to a baby's crying, generate and play appropriate content, and reduce the burden on caregivers.

[0719] "Sound acquisition means" refers to devices or functions for collecting sounds from the surrounding environment.

[0720] "Information processing means" refers to the technologies and processes used to analyze collected sounds and extract features from them.

[0721] "Means of response" refers to a function that confirms and compares the extracted features with the accumulated information.

[0722] "Information generation means" refers to technology that generates new content such as audio and video based on the results of the correspondence.

[0723] "Information presentation means" refers to devices or functions that allow users to view generated content.

[0724] The "function to remove noise and shape sound" refers to technology that removes unwanted noise from audio and prepares it as accurate audio data.

[0725] "Evaluation information" refers to reactions and feedback data obtained from users, and is used to improve generation technology.

[0726] This invention relates to a system that collects, analyzes, and plays back generated content to a baby in order to provide an effective response to a baby's crying. This system mainly consists of two main components: a terminal and a server.

[0727] The device is placed near the baby and collects ambient sounds using voice acquisition technology. In this process, it is crucial to use a microphone to capture voices with high sensitivity and accuracy. The collected voice data is initially processed within the device by a voice processing library that performs noise reduction and voice normalization. This allows for accurate capture of crying even in noisy environments.

[0728] The initially processed audio data is transmitted to the server via the network. The server analyzes this audio using information processing tools and extracts features. The analysis software used is a speech recognition engine, which performs time-frequency analysis of the audio waveform. This clearly extracts features such as the pitch, intensity, and pattern of crying sounds.

[0729] After extracting features, the server compares them with a cloud-based database through a corresponding mechanism. Here, using an SQL database system, similar cases are identified by comparing and matching with a vast amount of historical data. Based on the valid patterns discovered through this matching, information generation tools utilizing generative AI technology are used to generate appropriate audio and video.

[0730] The generated content is sent back to the device and played back to the baby via an information display device. Here, speakers and displays are used to provide effective sounds and images that will capture the baby's attention. For example, if the baby is calmed by classical music, the prompt "Generate a calming melody based on classical music" can be entered into the generating AI.

[0731] Users observe their baby's reaction to the generated content and send feedback to the server via the device's interface. This feedback information is used as evaluation data to improve the generation algorithm on the server, enabling more personalized responses. In this way, an optimal and comfortable childcare environment can be provided for each individual baby.

[0732] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0733] Step 1:

[0734] The device is placed near the baby and collects ambient sounds using a microphone. The input is ambient noise, including the baby's crying. The device uses an audio processing library to denoise and normalize this audio data. This results in noise-removed and normalized crying data as output.

[0735] Step 2:

[0736] This initially processed audio data is sent to a server via the internet. The server receives this audio data as input and performs analysis using a speech recognition engine. During the analysis, features of the crying sound, specifically pitch, intensity, and pattern, are extracted. The output here is the extracted features.

[0737] Step 3:

[0738] The server matches the extracted features against a database in the cloud. The input is the extracted features, which are compared and matched against similar historical data using an SQL database system. This process identifies valid audio and video patterns from similar cases. The output is the identified audio and video patterns.

[0739] Step 4:

[0740] Based on the matching results, the server uses a generative AI model to generate new audio and video. The input is a specific pattern, which the generative AI model operates on to create special content suitable for babies. The output is the generated audio and video content.

[0741] Step 5:

[0742] The generated content is sent back to the device and played there. The device uses its speaker and display to present this content to the baby. The output is the audio and video being played, and the goal is to soothe the crying baby.

[0743] Step 6:

[0744] The user observes the baby's reaction to the played content and sends the results back to the server as feedback through the device's interface. The input is the baby's reaction information, which the server's generation AI algorithm uses to improve itself. The output is the improved generation algorithm, which can be used to generate more effective content.

[0745] (Application Example 1)

[0746] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0747] In modern society, reducing the burden of childcare is a crucial issue. In particular, effective and individualized approaches are needed to address a baby's crying, but traditional methods have their limitations. Furthermore, a baby's persistent crying is a source of parental stress, highlighting the need for improvements to the childcare environment. To address these challenges, a system is needed that can accurately analyze a baby's crying and provide appropriate responses instantly.

[0748] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0749] In this invention, the server includes an acoustic input means for collecting baby's voice information, an information processing means for analyzing the voice information and extracting attributes, and a matching means for matching the attributes with an information recording medium. This makes it possible to quickly analyze a baby's cries and respond appropriately to each individual baby.

[0750] "Acoustic input means" refers to a device used to collect sound information from the baby's surroundings, and includes microphones and sound sensors.

[0751] An "information processing device" is a device that has the function of extracting attributes based on collected audio information, and utilizes digital signal processing technology.

[0752] A "matching device" is a device for comparing and referencing extracted attributes with existing information recording media.

[0753] A "generation means" is a device that has the function of generating optimal audio or video based on the matching results, and utilizes generation AI technology.

[0754] "Presentation means" refers to a device that allows a baby to view generated audio and video, and includes speakers and displays.

[0755] A "control system" is a system for managing this series of processes and controlling the coordinated operation of each device.

[0756] An "automated machine" is a device such as a robot that automatically generates and presents sounds and images based on a baby's cries.

[0757] "Noise reduction" is a technique for removing unwanted noise from audio information and extracting clear speech.

[0758] "Speech standardization" is the process of organizing speech information collected under different circumstances and environments into a consistent format.

[0759] "Evaluation information" refers to reaction data and feedback collected from users, which is used to improve the generation process.

[0760] "Generation procedure" refers to the steps of the process or algorithm used to generate audio or video.

[0761] The system that implements this application collects and analyzes a baby's cries, and generates and presents appropriate content to soothe the baby. The entire system operates through the cooperation of three parties: the server, the terminal, and the user.

[0762] The server plays a central role in receiving and processing the baby's voice information. The collected voice information undergoes noise reduction and voice standardization on the server, and voice attributes are extracted by information processing tools. These extracted attributes are then compared with information recording media to generate optimal audio and video. A generation AI model is used as the generation tool, creating content tailored to the baby's preferences based on specific prompts.

[0763] The device is placed around the baby and collects crying sounds using an acoustic input device. This device transmits the collected audio information to a server, which receives the generated content and presents it to the baby. Speakers and displays are used as presentation methods.

[0764] Users observe the baby's reactions via their device and send evaluation information to the server. This evaluation information is used to improve the generation process.

[0765] For example, if a baby starts crying in the middle of the night, the device detects the crying and sends it to a server. A generation AI model then generates and plays calming melodies or videos of nature sounds. An example of a prompt would be, "Analyze the baby's crying and generate a calming sound with the following characteristics: gentle tone, calming melody, nature sounds."

[0766] This system provides content optimized for each individual baby, reducing the burden on caregivers and enabling the creation of a more comfortable childcare environment.

[0767] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0768] Step 1:

[0769] The device collects crying sounds using an acoustic input device placed near the baby. The input at this time is audio information including the baby's crying. The collected audio information is denoised and standardized within the device. The output is noise-free and standardized audio data.

[0770] Step 2:

[0771] The terminal sends the processed audio data to the server. The input is the audio data obtained by the terminal, and the same audio data is sent to the server. By the time the audio data arrives at the server, it is ready for analysis.

[0772] Step 3:

[0773] The server analyzes the received audio data using information processing tools and extracts audio attributes. The input is audio data, and the output is audio attributes such as pitch, intensity, and pattern. The server uses digital signal processing technology to detect audio attributes in detail.

[0774] Step 4:

[0775] The server compares the extracted audio attributes with the information recording medium. The input is the audio attributes, and the output is the result of matching them with similar attributes referenced from the information recording medium. Using this result, the server determines what kind of content is appropriate.

[0776] Step 5:

[0777] The server generates optimal audio and video content using a generation AI model based on the matching results. The input is the matching results, and prompts are used to generate content suitable for babies. The output is the generated audio and video. Specific prompts reflecting concrete examples are used during generation.

[0778] Step 6:

[0779] The server sends the generated audio and video to the terminal. The input is the generated content, and the output is also that content. The transmitted content is ready for playback on the terminal.

[0780] Step 7:

[0781] The device plays back received audio and video to the baby using a presentation mechanism. The input is content data, and the output is what the baby sees and reacts to. The speaker and display actually function.

[0782] Step 8:

[0783] The user observes the baby's reactions, collects evaluation information, and feeds it back to the server. The input is the baby's reaction data, and the output is evaluation information. The server uses this information to obtain data to improve the generation procedure.

[0784] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0785] This invention provides a system that collects and analyzes baby cries, generating and outputting appropriate sounds and images based on the analysis results, combined with an emotion engine that recognizes the user's emotions. This enables optimal responses that take into account not only the baby's emotional state but also the caregiver's emotional state.

[0786] The device uses a microphone to acquire audio data in order to properly capture the baby's cries. In addition, it simultaneously collects the user's voice and facial expression data using a camera.

[0787] The audio data is sent to a server where the characteristics of crying sounds are extracted. These characteristics are compared with past cases to formulate an appropriate response. The server also uses an emotion engine to analyze the user's emotions and identify the level of stress and changes in mood.

[0788] The server generates sounds and images optimized not only for the baby's condition but also for the user's emotions. For example, if the server determines the user is fatigued, it will select sounds that promote relaxation. These results are then sent to the device.

[0789] The device plays audio and video near the baby to ensure effective functionality. The user observes the effect after playback and provides feedback through the device's interface. This feedback is returned to the server and used to improve the generation algorithm.

[0790] For example, when a parent is dealing with their baby's nighttime crying, the system can analyze the baby's cries and the parent's stress level, and then provide relaxing videos accompanied by calming music. In this way, the entire system can reduce the psychological burden on both the baby and the caregiver, and optimize the childcare environment.

[0791] The following describes the processing flow.

[0792] Step 1:

[0793] The device uses a microphone to collect the baby's cries and simultaneously records the user's facial expressions and voice using a camera. This data is stored in both audio and video formats.

[0794] Step 2:

[0795] The device performs noise reduction and normalization on the collected audio data to generate clean data. This data is prepared for analysis while retaining the characteristics of a baby's crying.

[0796] Step 3:

[0797] The device sends pre-processed baby voice data and user video data to the server. The data is transmitted using a secure communication protocol.

[0798] Step 4:

[0799] When the server receives baby voice data, it uses a voice analysis algorithm to extract features of the crying. These features include elements such as pitch, intensity, and rhythm.

[0800] Step 5:

[0801] The server analyzes the user's video data and uses an emotion engine to identify the user's emotional state (e.g., joy, fatigue, stress). This information is used to adjust the response to the baby.

[0802] Step 6:

[0803] The server considers the baby's features and the user's emotional state and matches them against a database. This allows it to compare with similar past cases and plan the generation of appropriate sounds and images.

[0804] Step 7:

[0805] The server uses generative AI technology to generate sounds and images optimized to soothe a baby. This generation reflects the user's categorized emotional information. For example, if the user is fatigued, calming music and relaxation videos will be generated.

[0806] Step 8:

[0807] The server transmits the generated audio and video to the terminal. Secure data transfer is also performed during this process.

[0808] Step 9:

[0809] The device plays back received sounds and images towards the baby. Playback is performed using a dedicated audio system and display, encouraging the baby to respond.

[0810] Step 10:

[0811] The user observes the baby's reaction during and after playback and evaluates its effectiveness. The evaluation results are entered into the device as feedback.

[0812] Step 11:

[0813] The device sends user-entered feedback to the server. This feedback is used to further improve the generation algorithm and contribute to improving the quality of future deliverables.

[0814] (Example 2)

[0815] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0816] In childcare, it is necessary to comprehensively consider both the baby's cries and emotional state, as well as the caregiver's emotional state. However, conventional techniques only address the baby's condition, making it difficult to alleviate the caregiver's mental burden. Therefore, there is a need for a system that allows for appropriate intervention considering the conditions of both the baby and the caregiver.

[0817] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0818] In this invention, the server includes acoustic data collection means for collecting acoustic data, information processing means for analyzing the acoustic data and extracting features, and video analysis means for collecting user video data and analyzing emotions. This makes it possible to generate appropriate sounds and videos that take into account the state of both the baby and the caregiver.

[0819] "Sound acquisition means" refers to devices or systems for acquiring baby cries and other sound information.

[0820] "Information processing means" refers to a device or system that has the function of analyzing acquired acoustic data and extracting characteristic information from it.

[0821] The "recording section" is a database that stores past acoustic data and feature quantities, and holds information that should be cross-referenced.

[0822] A "matching means" is a system that has the function of comparing extracted feature quantities with data stored in the recording unit and evaluating the match or similarity.

[0823] "Video analysis means" refers to a system that detects the user's facial expressions and movements to determine their emotional state.

[0824] "Generation means" refers to technologies and systems for creating appropriate audio and video content based on analysis results.

[0825] "Means of expression" refers to a device or apparatus for providing generated audio and video content to users or babies.

[0826] "Noise reduction" is the process of removing unwanted background noise and other sounds from audio data to obtain clear audio information.

[0827] "Standardization" is the process of adjusting the volume and quality of audio data to a certain standard.

[0828] "Evaluation information" refers to feedback provided by users, including their reactions to the generated content and information on areas for improvement.

[0829] "Improving the generation method" refers to the process of optimizing the audio and video generation algorithms and systems based on evaluation information.

[0830] This invention is a system for optimizing the childcare environment, providing appropriate sound and visuals by analyzing the baby's cries and the caregiver's emotional state. Its main components include a terminal, a server, and a user.

[0831] The device is equipped with a microphone to collect the baby's cries and a camera to record the caregiver's facial expressions. This device is designed to acquire and process crying and video data in real time. The acquired data is encrypted and transmitted to a server via the internet.

[0832] The server has an information processing device that extracts feature quantities from the acoustic data and compares them with the database in the recording unit. Furthermore, the server analyzes the caregiver's video data and evaluates their emotional state using an emotion engine. This allows the baby's condition and the caregiver's emotions to be considered simultaneously, and an AI model is used to create optimal acoustic and video content.

[0833] The generated content is sent from the server to the device and played near the baby. This aims to soothe the baby and reduce stress for caregivers by providing appropriate media.

[0834] For example, if a baby starts crying in the middle of the night, the system sends crying data to a server and also analyzes the caregiver's stress level. As a result, if the baby is hungry and the caregiver is exhausted, calming music and videos encouraging feeding will be played on the device.

[0835] Examples of prompts for a generative AI model include the following:

[0836] "Please tell me how to develop an algorithm that automatically provides appropriate responses based on a baby's crying."

[0837] "Please provide information on methods for creating content that takes into account the emotional state of caregivers."

[0838] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0839] Step 1:

[0840] The device uses a microphone to collect acoustic data of the baby crying. This acoustic data is acquired as a raw audio signal. The collected acoustic data is temporarily stored in the device's internal memory and preprocessed, such as audio clearing and noise reduction. As output, noise-free and volume-normalized acoustic data is generated.

[0841] Step 2:

[0842] The device uses its camera to collect user facial expression data. This results in still images or video data to understand the user's current emotional state. The facial expression data is classified into basic emotion labels (e.g., joy, surprise, sadness) using a facial recognition algorithm. The output of this process is tag information indicating the user's emotional state.

[0843] Step 3:

[0844] The device sends pre-processed acoustic data and user emotion state tags to the server. The data is securely transferred to the server via the internet. When this data arrives at the server as input, the server begins the process of extracting features from the acoustic data. The output is stored on the server as a feature vector.

[0845] Step 4:

[0846] The server matches the extracted feature vectors with the recording unit's database. A machine learning algorithm is used for this matching to identify the closest crying pattern from past cases. The output of this matching process is data indicating the type of crying, and is labeled with classifications such as "hungry" or "uncomfortable."

[0847] Step 5:

[0848] The server integrates classification labels for audio data with the user's emotional state and uses a generative AI model to generate appropriate audio and video content. For example, if a baby's crying indicates "hunger" and the user is "fatigued," the server will generate calming music and a video encouraging breastfeeding. The output consists of playable audio and video files.

[0849] Step 6:

[0850] The server sends the generated audio and video files to the device. The device receives these files and plays the audio and video near the baby. The user can review the content and adjust the volume as needed during playback.

[0851] Step 7:

[0852] Users observe their baby's reaction to the provided content and provide feedback through the device's interface. This input data is sent back to the server and used to improve the algorithm of the generation AI model. The feedback output contributes to improving the accuracy of future content generation.

[0853] (Application Example 2)

[0854] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0855] Conventional systems for dealing with crying babies only respond to the baby's voice and have the drawback of not being able to take into account the emotional state of the caregiver at the time. As a result, even if appropriate sounds and images are provided based on the baby's condition, they may not adequately address the reduction of the caregiver's stress. This problem needs to be solved, and the entire childcare environment needs to be improved.

[0856] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0857] In this invention, the server includes a recognition means for recognizing the emotional state of the caregiver, an adjustment means for adjusting the generated acoustic and visual information based on the emotional information obtained by the recognition means, and a sound acquisition means for collecting the baby's voice information. This makes it possible to provide appropriate acoustic and visual information that takes into account both the baby's state and the caregiver's emotions.

[0858] A "sound acquisition device" is a mechanism used to collect the sounds of a baby crying, and is a device that has the function of accurately capturing sound data.

[0859] An "information processing means" is a mechanism that analyzes collected audio information and extracts feature information; it is a device that extracts necessary features from audio data and obtains information useful for subsequent processing.

[0860] A "comparison means" is a mechanism for comparing extracted feature information with a recording medium to determine whether there is a match or a difference.

[0861] A "generation means" is a mechanism for creating appropriate acoustic and visual information based on the matching results, and is a device that creates content to improve the user experience.

[0862] A "playback mechanism" is a device for presenting generated audio and visual information to the user.

[0863] A "recognition device" is a mechanism designed to read the emotional state of a caregiver, and it is a device that analyzes facial expressions and voice tone to determine their psychological state.

[0864] A "regulation mechanism" is a system for optimizing generated acoustic and visual information based on the emotional information of the caregiver.

[0865] This invention provides an implementation system that offers optimal acoustic and visual information based on the conditions of both the baby and the caregiver. This section describes the hardware and software responsible for the system's operation.

[0866] The server receives audio data acquired from the terminal to collect the baby's cries. As a means of acquiring audio, the terminal is equipped with a highly sensitive microphone that accurately captures the baby's voice while suppressing ambient noise and other sounds.

[0867] The acquired audio data is sent to a server and analyzed using information processing tools. This process utilizes Google Cloud Speech-to-Text, advanced speech recognition software, to extract characteristic audio information. The information processing tools also include functions for suppressing background noise and standardizing the audio.

[0868] The server uses the extracted feature information to compare it with data stored on the recording medium. The comparison means analyzes the patterns of the baby's cries and determines the appropriate response.

[0869] Furthermore, the server uses recognition means to analyze video data transmitted from the terminal in order to recognize the emotional state of the caregiver. IBM Watson Tone Analyzer is used for this emotion analysis. The emotional information obtained by the recognition means is used to customize the acoustic and visual information generated by the adjustment means.

[0870] The Adobe Premiere Pro API is used to generate audio and visual information, creating content tailored to the baby's and caregiver's condition. The generated content is transmitted to the device via a playback device and output from the device to the area around the baby.

[0871] As a concrete example, when a baby starts crying, the system uses a microphone to collect the sound and sends it to a server for analysis. Simultaneously, a camera captures the caregiver's facial expressions and analyzes their emotional state. Based on the analysis results, music that helps the baby relax and videos that reduce the caregiver's stress are generated.

[0872] An example of a prompt message is: "The baby is crying. Please immediately generate relaxing music and videos, and provide them, adjusting them according to the caregiver's stress level."

[0873] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0874] Step 1:

[0875] The device uses a microphone to capture the baby's cries. It receives ambient sound data as input and uses noise reduction technology to clearly extract the baby's cries. This processing yields audio data in a specific frequency band.

[0876] Step 2:

[0877] The device sends the acquired audio data to the server. The server uses Google Cloud Speech-to-Text to analyze the audio data and extract feature information. This generates specific data such as the pattern and volume of the voice, and the duration of the crying.

[0878] Step 3:

[0879] The server compares the extracted feature information with the database on the recording medium. Using the feature information as input, it determines the corresponding action by comparing it with past crying patterns. This results in obtaining appropriate music and video pattern data.

[0880] Step 4:

[0881] The device uses a camera to capture the caregiver's facial expressions in real time and sends the video data to a server. The server uses IBM Watson Tone Analyzer to analyze the emotional state. It receives facial expression data as input and outputs information about the user's emotional state (e.g., stress level and fatigue level).

[0882] Step 5:

[0883] The server integrates voice and emotion analysis results and generates customized audio and visual information using the Adobe Premiere Pro API. Based on prompts, it utilizes a generative AI model to create relaxation music and video content suitable for babies and caregivers.

[0884] Step 6:

[0885] The device receives the generated audio and visual information and outputs it through its speaker and display. This provides soothing music and images around the baby, allowing the user to experience psychological relaxation.

[0886] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0887] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0888] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0889] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0890] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0891] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0892] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0893] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0894] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0895] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0896] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0897] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0898] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0899] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0900] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0901] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0902] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0903] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0904] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0905] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0906] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted as being incorporated by reference.

[0907] The following is further disclosed regarding the embodiments described above.

[0908] (Claim 1)

[0909] A voice input method for collecting baby voice data,

[0910] A data processing means for extracting feature quantities for analyzing the aforementioned audio data,

[0911] A matching means for performing a match with a database based on the aforementioned feature quantities,

[0912] A generation means for generating appropriate sound and video based on the aforementioned matching results,

[0913] A playback means for outputting the generated sound and video,

[0914] A system that includes this.

[0915] (Claim 2)

[0916] The system according to claim 1, wherein the data processing means further comprises a function for noise reduction and speech normalization.

[0917] (Claim 3)

[0918] The system according to claim 1, wherein the generation means further comprises a function to improve the generation algorithm based on feedback information obtained from the user.

[0919] "Example 1"

[0920] (Claim 1)

[0921] A means of acquiring sound for collecting sound,

[0922] Information processing means for analyzing the aforementioned sound and extracting its features,

[0923] A means for corresponding with stored information based on the aforementioned characteristics,

[0924] Information generation means for generating information based on the aforementioned correspondence results,

[0925] Information presentation means for outputting the generated information,

[0926] A system that includes this.

[0927] (Claim 2)

[0928] The system according to claim 1, wherein the information processing means further comprises a function to remove noise from sound and shape the sound.

[0929] (Claim 3)

[0930] The system according to claim 1, wherein the information generation means further comprises a function to improve the generation technology based on evaluation information obtained from the user.

[0931] "Application Example 1"

[0932] (Claim 1)

[0933] An acoustic input means for collecting baby's voice information,

[0934] Information processing means for extracting attributes for analyzing the aforementioned audio information,

[0935] A matching means for performing a match with an information recording medium based on the aforementioned attributes,

[0936] A generation means for generating appropriate audio and video based on the aforementioned matching results,

[0937] A presentation means for outputting the generated audio and video,

[0938] An automated machine that includes control mechanisms for managing the entire system.

[0939] (Claim 2)

[0940] The automatic machine according to claim 1, wherein the information processing means further comprises a function for noise reduction and speech standardization.

[0941] (Claim 3)

[0942] The automatic machine according to claim 1, wherein the generation means further comprises a function to improve the generation procedure based on evaluation information obtained from the user.

[0943] "Example 2 of combining an emotion engine"

[0944] (Claim 1)

[0945] Acoustic acquisition means for collecting acoustic data,

[0946] Information processing means for analyzing the aforementioned acoustic data and extracting feature quantities,

[0947] A matching means for performing a comparison with the recording unit based on the aforementioned feature quantities,

[0948] A video analysis method for collecting user video data and analyzing emotions,

[0949] A generation means for generating appropriate sound and video based on the aforementioned matching results and user emotion analysis results,

[0950] A means for outputting the generated sound and image,

[0951] A system that includes this.

[0952] (Claim 2)

[0953] The system according to claim 1, wherein the information processing means further comprises a function for noise reduction and acoustic standardization.

[0954] (Claim 3)

[0955] The system according to claim 1, wherein the generation means further comprises a function to improve the generation method based on evaluation information obtained from the user.

[0956] "Application example 2 when combining with an emotional engine"

[0957] (Claim 1)

[0958] A means of acquiring sound information for collecting baby's voice information,

[0959] Information processing means for extracting feature information for analyzing the aforementioned audio information,

[0960] A matching means for performing a comparison with a recording medium based on the aforementioned characteristic information,

[0961] A generation means for generating appropriate acoustic and visual information based on the aforementioned matching results,

[0962] A playback means for outputting the generated sound and visual information,

[0963] A means of recognizing the emotional state of a caregiver,

[0964] An adjustment means for adjusting the generated sound and visual information based on the emotional information obtained by the recognition means,

[0965] A system that includes this.

[0966] (Claim 2)

[0967] The system according to claim 1, wherein the information processing means further comprises a function for suppressing external sounds and standardizing speech.

[0968] (Claim 3)

[0969] The system according to claim 1, wherein the generation means further comprises a function to improve the generation procedure based on evaluation information obtained from the user. [Explanation of Symbols]

[0970] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A voice input method for collecting baby voice data, A data processing means for extracting feature quantities for analyzing the aforementioned audio data, A matching means for performing a match with a database based on the aforementioned feature quantities, A generation means for generating appropriate sound and video based on the aforementioned matching results, A playback means for outputting the generated sound and video, A system that includes this.

2. The system according to claim 1, wherein the data processing means further comprises a function for noise reduction and speech normalization.

3. The system according to claim 1, wherein the generation means further comprises a function to improve the generation algorithm based on feedback information obtained from the user.