system

A system for analyzing baby cries to provide timely and accurate childcare support by collecting, processing, and notifying caregivers of appropriate actions, addressing the challenge of determining the cause of crying and reducing caregiver stress.

JP2026100696APending Publication Date: 2026-06-19SOFTBANK GROUP CORP

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
SOFTBANK GROUP CORP
Filing Date
2024-12-09
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Caregivers, especially first-time caregivers, struggle to determine the cause of a baby's crying accurately, leading to increased stress and potential discomfort for both the baby and themselves due to the lack of effective technologies for analyzing crying sounds and providing appropriate countermeasures.

Method used

A system utilizing a computing device to collect and analyze baby cries, extract features, and generate appropriate countermeasures, notifying caregivers through a central device for faster and more accurate childcare support.

Benefits of technology

The system provides prompt and accurate childcare support by analyzing baby cries, reducing caregiver stress and improving the health and comfort of the baby through timely and appropriate responses.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026100696000001_ABST
    Figure 2026100696000001_ABST
Patent Text Reader

Abstract

We provide the system. [Solution] A computing device for collecting baby sounds, A central device for analyzing the aforementioned audio and extracting features, A central device for predicting the baby's condition based on the aforementioned characteristics and generating countermeasures, A computing device for notifying the person responsible for the generated countermeasures, A system that includes this.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] It is difficult for caregivers to determine the cause of a baby's crying. Especially for first-time caregivers, they often don't know how to respond appropriately. This problem increases the stress of childcare and may also affect the health and comfort of the baby. Conventional methods lack technologies to accurately analyze a baby's crying sound and quickly present appropriate countermeasures.

Means for Solving the Problems

[0005] This invention solves the above problem by providing a system that uses a computing device to collect a baby's cries and a central device to analyze the sounds and extract features. Furthermore, the central device estimates the baby's condition based on the extracted features and generates appropriate countermeasures. Based on this, the computing device notifies the caregiver of specific countermeasures, enabling faster and more accurate childcare support.

[0006] A "computing device" is an electronic device used for collecting, processing, and providing information to users.

[0007] A "central system" is a centrally managed digital computer system that aggregates data from multiple computing devices and performs analysis and decision-making.

[0008] "Sound analysis" is the process of analyzing the characteristics of digital sound data and extracting specific patterns or information.

[0009] "Feature extraction" is the process of identifying specific parameters or attributes from collected audio data.

[0010] "State estimation" is the process of estimating the current state and conditions of an object based on the data obtained.

[0011] "Countermeasure generation" is the process of determining specific actions and response methods to be taken based on the analysis results.

[0012] A "notification" is the act of informing a user of specific information visually or audibly. [Brief explanation of the drawing]

[0013] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3]It is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which a plurality of emotions are mapped. [Figure 10] It shows an emotion map to which a plurality of emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Example 2 when an emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when an emotion engine is combined.

Embodiments for Carrying Out the Invention

[0014] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0015] First, the terms used in the following description will be explained.

[0016] In the following embodiments, the labeled processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0017] In the following embodiments, the labeled RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0018] In the following embodiments, the labeled storage is one or more non-volatile storage devices that store various programs, various parameters, and the like. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes.

[0019] In the following embodiments, the labeled communication I / F (Interface) is an interface including a communication processor, an antenna, and the like. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).

[0020] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0021] [First Embodiment]

[0022] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0023] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0024] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0025] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0026] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0027] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0028] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0029] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0030] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0031] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0032] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0033] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0034] This invention is a system that suggests appropriate responses to caregivers based on the sounds of a baby crying. This system is realized by collecting audio via a smart device and analyzing that information on a server.

[0035] First, the device monitors ambient sounds through its built-in microphone and begins recording when it detects a baby crying. This recorded data is then transmitted to a server via a secure network.

[0036] Next, the server analyzes the received audio data. This involves using audio processing algorithms to extract features such as frequency, amplitude, and duration of the sound. This analysis identifies crying patterns and classifies the baby's condition (hungry, sleepy, stressed, unwell, etc.).

[0037] The server then uses its pre-programmed learning database to predict the most appropriate course of action for the baby's condition. For example, if the baby is determined to be hungry, it will generate specific advice such as, "It's time to breastfeed."

[0038] The server then sends this solution to the device, and the user is notified. The device displays this as a pop-up message or push notification, providing the information in a way that is easy for caregivers to check and act upon.

[0039] Finally, users can provide feedback to the system about the actions they actually took. This allows the server to learn from the new data and further improve the accuracy of future analyses.

[0040] As a concrete example, consider a scenario where a baby starts crying. At this point, the device records the audio, and the server notifies the user of the analysis result, stating, "The baby's diaper may be wet." The user checks the diaper, and if it is indeed wet, provides feedback, which updates the database. This collaboration improves the quality of service in subsequent instances.

[0041] The following describes the processing flow.

[0042] Step 1:

[0043] The device constantly monitors ambient noise and starts recording when it detects a baby's crying exceeding a certain decibel level.

[0044] Step 2:

[0045] The device compresses the recorded audio data and sends it to the server using a secure protocol.

[0046] Step 3:

[0047] The server decompresses the received audio data and performs signal analysis. It extracts features such as frequency and volume to analyze the crying patterns.

[0048] Step 4:

[0049] The server applies crying patterns to a machine learning model and compares them with past training data to infer the baby's condition (hunger, sleepiness, wet diaper, health problems, etc.).

[0050] Step 5:

[0051] The server generates advice on countermeasures based on the inferred state. This includes specific action points and suggestions for childcare products.

[0052] Step 6:

[0053] The server sends the generated advice back to the terminal.

[0054] Step 7:

[0055] The device notifies the user of any advice it receives. Notifications are delivered via screen display, audio output, vibration, etc.

[0056] Step 8:

[0057] The user reviews the advice and then takes specific actions to take with the baby based on it.

[0058] Step 9:

[0059] The user inputs the results of the actions they have taken into the terminal and provides feedback to the system.

[0060] Step 10:

[0061] The server receives user feedback and updates the database to improve the accuracy of future analyses.

[0062] (Example 1)

[0063] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0064] In modern times, caregivers are required to quickly and accurately determine the cause of a baby's crying, but the technology to achieve this is still not sufficiently developed. This is especially burdensome for caregivers with little experience in distinguishing between newborn cries, so there is a need for a system that can more accurately identify the cause of crying and suggest appropriate responses.

[0065] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0066] In this invention, the server includes means for using a device that collects acoustic data and detects specific sound patterns; means for using a processing device that analyzes the acoustic data and extracts features such as frequency, amplitude, and duration; and means for inferring the cause of the sound by comparing the features with previously stored data. This enables caregivers to obtain quick and accurate countermeasures.

[0067] "Acoustic data" refers to data that electronically records sound information from the environment and is used to analyze specific sound patterns.

[0068] A "device" is a general term for computing equipment used to collect acoustic data and detect specific sound patterns.

[0069] A "processing device" is a computer system that analyzes received acoustic data and extracts features such as frequency, amplitude, and duration.

[0070] "Accumulated data" refers to audio data and its analysis results that have been collected in the past and stored within the system, and is used for comparison with new audio data.

[0071] A "generative AI model" is an artificial intelligence model that utilizes machine learning techniques to perform speech analysis and has the ability to continuously improve the accuracy of pattern recognition.

[0072] A "specific sound pattern" is a set of sound features that occur under specific circumstances, such as a baby crying, and is detected by the device based on set conditions.

[0073] This invention is a system that suggests appropriate responses to caregivers based on a baby's cries, and functions effectively through the collection, analysis, and notification of acoustic data. The system's components are a terminal, a server, and a user.

[0074] The device monitors ambient sounds using a high-sensitivity microphone built into a device such as a smartphone or tablet. When a baby's crying is detected, the sound is recorded and stored as audio data. The recorded data is encrypted for security purposes and sent to the server via a secure communication protocol (e.g., HTTPS).

[0075] After receiving acoustic data, the server analyzes the data using an audio processing algorithm. This algorithm extracts features such as frequency, amplitude, and duration from the audio. The features obtained from the analysis are compared with stored data, and a generative AI model is used to identify specific patterns in a baby's crying. This allows for the estimation of causes such as hunger, lack of sleep, stress, or poor health.

[0076] For example, if the server determines that "the baby is complaining of diaper discomfort," it sends that information to the device. The device then notifies the caregiver of this prediction and displays it as a message including specific action instructions. This allows the caregiver to quickly take care of the baby.

[0077] After users have taken action, they can provide feedback on the results to the system. This feedback information is sent to the server and used to train the generated AI model, leading to improvements in the accuracy of future analyses.

[0078] An example of a prompt might be, "Please identify the reason why the baby started crying and provide specific solutions." Based on this prompt, the system generates appropriate advice for the caregiver.

[0079] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0080] Step 1:

[0081] The device constantly monitors ambient sounds using its built-in microphone. Recording begins immediately upon detection of a specific sound pattern, such as a baby crying. The input is real-time acoustic data, and the output is recorded audio data. Specifically, noise filtering is applied to reduce background noise and clearly capture the baby's crying.

[0082] Step 2:

[0083] The terminal sends recorded audio data to the server using a secure communication protocol (such as HTTPS). The input is the recorded audio file, and the output is the transmission of encrypted audio data. The specific operation includes a process of data compression to improve the efficiency of the transfer.

[0084] Step 3:

[0085] The server analyzes the received audio data using an audio processing algorithm. This analysis extracts features such as frequency, amplitude, and duration from the audio. The input is the audio data sent to the server, and the output is the analyzed feature data. Specifically, the process involves using audio spectral analysis to quantify the features.

[0086] Step 4:

[0087] The server compares the feature data obtained from the analysis with the stored data and uses a generative AI model to infer the cause of the crying. The input is the feature data and the stored database, and the output is the category of the inferred cause. Specifically, the machine learning model is applied recursively, and pattern matching is performed.

[0088] Step 5:

[0089] The server generates a response plan for the caregiver based on the suspected cause and sends it to the terminal. The input is the category of the cause, and the output is a message containing specific action instructions for the caregiver. Specifically, the server selects the optimal action from the database.

[0090] Step 6:

[0091] The device notifies the caregiver of the received instructions. Input is a message from the server, and output is a pop-up or push notification on the device. Specific actions include the visual display of messages on the user interface.

[0092] Step 7:

[0093] The user provides care for the baby based on the instructions given and feeds the results back to the system. The input is the actions taken by the caregiver and their results, and the output is the data sent to the server as feedback information. Specifically, the caregiver provides feedback by selecting options and entering comments through the app.

[0094] (Application Example 1)

[0095] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0096] Accurately interpreting a baby's cries and promptly providing appropriate advice to caregivers is crucial in reducing the burden of childcare. However, conventional childcare support systems have limitations in voice analysis and lack the ability to accurately predict various conditions of a baby. Furthermore, there is room for improvement in methods that shorten the time it takes for caregivers to take action and support specific childcare actions. To address these challenges, a system is needed that provides more accurate voice analysis and the rapid and accurate childcare support based on that analysis.

[0097] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0098] In this invention, the server includes an information processing means for collecting the baby's voice along with environmental information, an information processing device for analyzing the voice and extracting features, and an information processing device for inferring the baby's condition based on the features and generating countermeasures. This makes it possible to classify the diverse conditions of the baby with high accuracy and to quickly provide specific action guidelines to caregivers.

[0099] An "information processing device" is a device for collecting audio and environmental information and processing it as data.

[0100] An "information processing device" is a device that extracts and analyzes features from collected audio data.

[0101] A "machine" is a device that operates automatically based on notified countermeasures to assist with childcare.

[0102] A "past learning data recording device" is a device that stores previously collected data and learning results, and compares them with new data.

[0103] "Action instructions" are specific guidance that elicits particular actions and indicates how caregivers or machines should respond.

[0104] To implement this invention, it is necessary to construct a system that collects a baby's voice and provides appropriate advice to the caregiver. This system consists of an information processing means, an information processing device, a machine, and a device for recording past learning data.

[0105] The terminal is installed as a home robot and uses a high-sensitivity microphone to constantly monitor ambient sounds. When it detects a baby crying, it records the audio data and sends it to a server. The server uses speech processing software such as Google® Cloud Speech-to-Text and IBM Watson® to analyze the transmitted audio data and extract features such as frequency, amplitude, and duration. By comparing the extracted features with past learning data from a recording device, the system identifies the baby's condition and generates the optimal response accordingly.

[0106] The generated countermeasures are notified to a home robot, which acts as a terminal, and the robot provides information to the caregiver using voice and a display. For example, if the baby starts crying in the middle of the night, the caregiver will receive a notification saying, "The baby may be getting sleepy." Furthermore, the robot will take actions such as playing a gentle lullaby to soothe the baby.

[0107] Thus, the present invention aims to provide prompt and accurate childcare support based on the sounds of a baby crying, thereby reducing the burden on childcare providers.

[0108] An example of a prompt message for the generative AI model would be: "I have recorded my baby crying. Please send this data to the server and generate optimal parenting advice based on the analysis results."

[0109] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0110] Step 1:

[0111] The device constantly monitors ambient sounds using its built-in high-sensitivity microphone, and when it detects a baby crying, it records the sound. This recorded data becomes the input. The recorded audio data is processed as a digital signal and prepared for transmission to the server.

[0112] Step 2:

[0113] The server receives audio data sent from the terminal. The input audio data is analyzed using Google Cloud Speech-to-Text or IBM Watson. Data processing is performed to extract features such as the frequency characteristics, amplitude, and duration of the audio, and the crying pattern is identified based on the results.

[0114] Step 3:

[0115] The server predicts the baby's state (hunger, sleepiness, stress, etc.) by comparing the extracted audio features with past training data recorded by a machine learning algorithm. The data calculation used here is pattern matching using a machine learning algorithm. The predicted state is obtained as the output.

[0116] Step 4:

[0117] The server generates appropriate countermeasures based on the estimated state of the baby. This generation process refers to a pre-entered database of childcare advice and outputs the most suitable message. For example, it might generate specific advice such as, "The baby seems sleepy."

[0118] Step 5:

[0119] Notification information is sent to the device. The device receives the message and provides the information to the caregiver through a display or audio output device. The information is displayed using an intuitive interface to make it easier for the caregiver to check the appropriate course of action.

[0120] Step 6:

[0121] Users (caregivers) take specific caregiving actions based on the advice they receive. For example, they might try to get their baby to sleep or breastfeed. They can provide feedback on the actions they take, which helps improve the accuracy of future analyses.

[0122] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0123] This invention is a system that aims to provide optimal childcare support by considering not only the baby's cries but also the caregiver's emotional state. This system provides comprehensive support to caregivers by integrating voice analysis, emotion recognition, generation of appropriate countermeasures, and notification thereof.

[0124] The device collects the baby's cries using a microphone as usual and sends the recorded data to a server. At the same time, the device is equipped with the user's camera, which captures the user's facial expressions and provides information to the emotion engine.

[0125] The server processes the baby's crying data using a voice analysis algorithm to extract features. This result is then compared with a machine learning model to estimate the baby's state. Additionally, based on the user's facial expressions captured by the camera, an emotion engine analyzes the user's emotions and determines their state (e.g., stress, fatigue, calmness).

[0126] Based on this analysis, the server creates a response optimized for the baby's condition and the user's emotions. For example, if the baby is hungry and the user is tired, it provides simple, quick action plans and a list of recommended baby products.

[0127] The server then sends the generated notification to the device, which provides the user with visual and auditory notifications. The notification content is designed to be easy to understand and act upon, taking into account the user's situation, with the aim of reducing parenting stress.

[0128] As a concrete example, consider a scenario where a baby starts crying at night. The device detects the crying and sends it to the server, while simultaneously capturing the user's facial expression with a camera. If the server determines the baby is "hungry" and the user is "tired," a specific and simple message such as "Feeding is needed. We recommend preparing formula milk" is generated and sent as a notification from the device. This allows caregivers to take appropriate action quickly, reducing their burden.

[0129] The following describes the processing flow.

[0130] Step 1:

[0131] The device records the baby's cries using a microphone and constantly monitors the surrounding sounds. This process is configured to start recording when it detects sounds exceeding a certain volume or frequency.

[0132] Step 2:

[0133] The device uses the user's camera to capture facial expressions. It provides facial recognition technology and collects facial expression data in real time.

[0134] Step 3:

[0135] The device formats the recorded audio and facial expression data appropriately and sends it to the server using a secure protocol.

[0136] Step 4:

[0137] The server receives the audio data and uses an audio analysis algorithm to extract the characteristics of the crying sounds. This analysis evaluates data such as sound patterns, length, and intervals.

[0138] Step 5:

[0139] The server uses a pre-trained machine learning model to analyze the characteristics of the baby's cries and predict its condition. This is then categorized into specific states such as hunger, sleepiness, or illness.

[0140] Step 6:

[0141] The server simultaneously analyzes facial expression data using an emotion engine to determine the user's emotional state (e.g., tired, stressed). It focuses on analyzing the movement of facial muscles and eye movements.

[0142] Step 7:

[0143] The server integrates the baby's estimated state with the user's emotional state to generate the optimal response. This proposal includes specific action instructions tailored to the baby's state and support measures that take the user's emotional state into consideration.

[0144] Step 8:

[0145] The server generates and delivers countermeasures to the terminal. These include not only action instructions but also encouragement and advice that takes into account the user's psychological state.

[0146] Step 9:

[0147] The device sends notifications to the user. These notifications are displayed visually as pop-up messages, and audio and vibration feedback can also be configured.

[0148] Step 10:

[0149] The user checks the notification and takes action based on the suggested response. At this time, the user can input the results of their actions as feedback into their device.

[0150] Step 11:

[0151] The server receives feedback from users and updates the database based on that feedback. This makes it possible to improve the accuracy of future analyses.

[0152] (Example 2)

[0153] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0154] In childcare, quickly determining the cause of a baby's crying and appropriately communicating the necessary response to the caregiver is a challenging task. Furthermore, simply offering general solutions without considering the caregiver's emotional state can increase their burden. This invention aims to provide optimal support by considering not only the baby's condition but also the caregiver's emotional state.

[0155] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0156] In this invention, the server includes means for analyzing speech and extracting features, means for recognizing the emotional state of the caregiver, and means for integrating the aforementioned speech features and the caregiver's emotional state to generate an optimal response. This makes it possible to provide optimal childcare support based on the state of the person being cared for and the emotions of the caregiver.

[0157] An "electronic device" is a device used to collect digital audio data and to record the crying of a child.

[0158] A "data processing device" is a computer device used to analyze collected audio data and extract its characteristics.

[0159] A "central control unit" is a device that has the function of generating optimal childcare responses based on voice characteristics and the emotional state of the caregiver.

[0160] A "display device" is a device that notifies the caregiver of the generated countermeasures visually or audibly.

[0161] "Target of childcare" refers to babies and infants who are the target of support in this invention.

[0162] "The emotional state of a caregiver" refers to the emotional conditions that caregivers experience during childcare, such as stress, fatigue, and relaxation.

[0163] "Optimal response measures" refer to specific and practical instructions and recommendations for childcare that are generated based on the condition of the person being cared for and the emotional state of the caregiver.

[0164] "Training data" refers to information accumulated from past analysis results of speech and emotions, and is a dataset used as a standard or reference in current analysis.

[0165] This invention is a comprehensive system for supporting childcare, providing optimal childcare support by analyzing the baby's cries and the caregiver's emotional state. The system mainly consists of terminals, a server, and the caregiver.

[0166] The device is an electronic device equipped with a microphone and a camera, which is used to record the baby's cries and capture the caregiver's facial expressions. The recorded audio data and captured image data are transmitted to a server in real time.

[0167] The server performs key data processing for speech analysis and emotion recognition. This analysis is carried out by applying speech analysis algorithms to the speech data. It also processes image data received from the camera using emotion recognition software to determine the emotional state of the caregiver. The accuracy of the analysis is improved by comparing the speech characteristics and the caregiver's emotional state with past training data.

[0168] Past training data is accumulated by machine learning algorithms and forms the basis for the system to understand voice and facial expression patterns. Based on this, the server uses a generative AI model to create optimal responses. These responses are then communicated to caregivers as visual or audio messages. The notifications are specific and simple to enable caregivers to respond quickly.

[0169] To give a specific example, if a baby starts crying at night, the device quickly detects the crying and simultaneously assesses the caregiver's facial expression. For instance, if the device determines that the baby is "hungry" and the caregiver is "tired," it generates a message such as, "The baby needs to be fed. Please prepare the formula." This allows caregivers to efficiently resolve childcare challenges.

[0170] Examples of input prompts for the generative AI model include specific instructions such as, "The baby is crying. Based on the audio data and the user's emotions, please provide appropriate childcare solutions." This allows the system to provide accurate childcare support based on a combination of data.

[0171] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0172] Step 1:

[0173] The device uses a microphone to collect the baby's cries in real time. The cries, as input, are converted into digital audio data and stored on the device. Simultaneously, a camera is used to capture the caregiver's facial expressions and record them as image data. This allows for the simultaneous collection of audio and image data.

[0174] Step 2:

[0175] The device transmits the collected audio and image data to the server via the internet. During this transmission process, the data is encrypted to ensure data security. The output is the transmitted audio and image dataset.

[0176] Step 3:

[0177] The server feeds the received audio data to a speech analysis algorithm. Data processing is performed to extract features such as pitch, intensity, and rhythm from the input audio data. This output feature data is then compared with a machine learning model to classify the baby's state into categories such as "hungry," "sleepy," and "uncomfortable."

[0178] Step 4:

[0179] The server analyzes the emotional state of the caregiver using image data obtained from the camera. Emotion recognition software evaluates the movement of facial muscles and facial patterns, and determines emotions such as "stress" and "fatigue" from the input images. This process yields emotional state as feature data.

[0180] Step 5:

[0181] The server integrates the obtained voice characteristics with the caregiver's emotional state and generates the optimal response using a generative AI model. Here, voice data and emotional data are processed as input, and the necessary actions and recommendations are concretized. The output is an action plan optimized for both the baby and the caregiver.

[0182] Step 6:

[0183] The server sends the generated countermeasures to the terminal. The terminal receives this information and notifies the caregiver using the display screen or speaker. The notification includes visual messages and audio guidance to deliver quick and clear instructions to the caregiver.

[0184] Through these steps, the system can process the input data and provide parents with rational and practical solutions to difficult situations in childcare.

[0185] (Application Example 2)

[0186] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0187] In today's childcare environment, where it is difficult to respond appropriately to a baby's crying, the challenge is to alleviate the mental and physical burden on caregivers and provide prompt and effective childcare support. Furthermore, because the emotional state of the caregiver themselves is not taken into consideration, the current support is not truly meaningful for them.

[0188] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0189] In this invention, the server includes acoustic detection means for collecting the baby's voice, visual information acquisition means for capturing the caregiver's facial expressions and analyzing their emotions, and a central processing unit for estimating the baby's and caregiver's conditions based on the aforementioned characteristics and the caregiver's emotions, and for generating countermeasures. This enables comprehensive childcare support that takes into account both the baby's condition and the caregiver's emotions.

[0190] "Acoustic detection means" refers to a sensor device used to collect the sound of a baby crying.

[0191] A "processing device" is a device used to analyze collected audio data and extract its characteristics.

[0192] A "visual information acquisition device" is a device that captures the facial expressions of a caregiver and analyzes their emotional state.

[0193] A "central processing unit" is a computing device that estimates the baby's condition and the caregiver's condition based on audio and visual data, and generates the optimal course of action.

[0194] An "output device" is a device that provides notifications to caregivers visually or audibly.

[0195] This invention relates to a system that analyzes a baby's voice and the caregiver's emotions to provide optimal childcare support. The system comprises an acoustic detection means, a visual information acquisition means, a central processing unit, and an output device.

[0196] First, an acoustic detection system collects the baby's cries. This hardware includes an audio input device such as a high-sensitivity microphone. For visual information acquisition, a camera is used to capture the caregiver's facial expressions. Specific examples include a standard USB camera or a camera built into a smartphone.

[0197] The server analyzes the collected audio data using a processing unit and extracts its features. This process utilizes an audio analysis algorithm to analyze the patterns of the baby's cries. Next, the facial expression data of the caregiver, obtained using visual information acquisition means, is analyzed by emotion recognition software. Specific software used includes "EmotionEngine," among others.

[0198] Based on the analysis results, the central processing unit infers the baby's condition and the caregiver's emotions, and generates the optimal course of action based on this information. This involves matching the data with a database that includes machine learning models. The generated course of action is notified to the caregiver visually or audibly through an output device. Specific examples include push notifications and voice guidance via smartphones or smart glasses.

[0199] This invention makes it possible to provide accurate childcare support that takes into account the baby's crying and the caregiver's condition, thereby reducing the burden on the caregiver.

[0200] Example of a prompt:

[0201] "Enter a Japanese sentence. For example, 'The baby has started crying. The mother appears tired.' Please generate the best parenting advice for this sentence."

[0202] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0203] Step 1:

[0204] The user activates the acoustic detection device near the baby. The acoustic detection device collects the baby's cries through a microphone. The input is real-time audio data, which is converted into digital data and sent to the processing unit as an audio signal. The output is digital audio data to be analyzed.

[0205] Step 2:

[0206] The server analyzes the audio data acquired using a processing unit. It receives digital audio data as input and applies an audio analysis algorithm. By extracting features from the data and analyzing the patterns of the baby's cries, it infers the baby's state (e.g., hungry, sleepy, etc.). The output is information indicating the baby's state.

[0207] Step 3:

[0208] The user activates a visual information acquisition device and captures their facial expressions with a camera. The input is real-time video data, which is sent to the server as a digital image. The output is video data ready for emotion analysis.

[0209] Step 4:

[0210] The server analyzes the user's video data using emotion recognition software. It receives digital video data as input and applies an emotion analysis algorithm. It determines the user's emotional state (e.g., stress, fatigue) and generates information indicating the emotional state as output.

[0211] Step 5:

[0212] The server integrates baby state information obtained from voice analysis and user emotional state information obtained from emotion analysis in the central processing unit. The input consists of both the baby's state and the user's emotional state. This information is compared with a machine learning model in the database to generate the optimal response. The output is a message containing specific childcare support measures to notify the user.

[0213] Step 6:

[0214] The server notifies the user of generated childcare support messages visually or audibly through an output device. The input is the generated message, and the output is the information the user receives. Specific examples include push notifications and voice guidance via smartphones and smart glasses.

[0215] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0216] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0217] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0218] [Second Embodiment]

[0219] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0220] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0221] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0222] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0223] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0224] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0225] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0226] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0227] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0228] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0229] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0230] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0231] This invention is a system that suggests appropriate responses to caregivers based on the sounds of a baby crying. This system is realized by collecting audio via a smart device and analyzing that information on a server.

[0232] First, the device monitors ambient sounds through its built-in microphone and begins recording when it detects a baby crying. This recorded data is then transmitted to a server via a secure network.

[0233] Next, the server analyzes the received audio data. This involves using audio processing algorithms to extract features such as frequency, amplitude, and duration of the sound. This analysis identifies crying patterns and classifies the baby's condition (hungry, sleepy, stressed, unwell, etc.).

[0234] The server then uses its pre-programmed learning database to predict the most appropriate course of action for the baby's condition. For example, if the baby is determined to be hungry, it will generate specific advice such as, "It's time to breastfeed."

[0235] The server then sends this solution to the device, and the user is notified. The device displays this as a pop-up message or push notification, providing the information in a way that is easy for caregivers to check and act upon.

[0236] Finally, users can provide feedback to the system about the actions they actually took. This allows the server to learn from the new data and further improve the accuracy of future analyses.

[0237] As a concrete example, consider a scenario where a baby starts crying. At this point, the device records the audio, and the server notifies the user of the analysis result, stating, "The baby's diaper may be wet." The user checks the diaper, and if it is indeed wet, provides feedback, which updates the database. This collaboration improves the quality of service in subsequent instances.

[0238] The following describes the processing flow.

[0239] Step 1:

[0240] The device constantly monitors ambient noise and starts recording when it detects a baby's crying exceeding a certain decibel level.

[0241] Step 2:

[0242] The device compresses the recorded audio data and sends it to the server using a secure protocol.

[0243] Step 3:

[0244] The server decompresses the received audio data and performs signal analysis. It extracts features such as frequency and volume to analyze the crying patterns.

[0245] Step 4:

[0246] The server applies crying patterns to a machine learning model and compares them with past training data to infer the baby's condition (hunger, sleepiness, wet diaper, health problems, etc.).

[0247] Step 5:

[0248] The server generates advice on countermeasures based on the inferred state. This includes specific action points and suggestions for childcare products.

[0249] Step 6:

[0250] The server sends the generated advice back to the terminal.

[0251] Step 7:

[0252] The device notifies the user of any advice it receives. Notifications are delivered via screen display, audio output, vibration, etc.

[0253] Step 8:

[0254] The user reviews the advice and then takes specific actions to take with the baby based on it.

[0255] Step 9:

[0256] The user inputs the results of the actions they have taken into the terminal and provides feedback to the system.

[0257] Step 10:

[0258] The server receives user feedback and updates the database to improve the accuracy of future analyses.

[0259] (Example 1)

[0260] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0261] In modern times, caregivers are required to quickly and accurately determine the cause of a baby's crying, but the technology to achieve this is still not sufficiently developed. This is especially burdensome for caregivers with little experience in distinguishing between newborn cries, so there is a need for a system that can more accurately identify the cause of crying and suggest appropriate responses.

[0262] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0263] In this invention, the server includes means for using a device that collects acoustic data and detects specific sound patterns; means for using a processing device that analyzes the acoustic data and extracts features such as frequency, amplitude, and duration; and means for inferring the cause of the sound by comparing the features with previously stored data. This enables caregivers to obtain quick and accurate countermeasures.

[0264] "Acoustic data" refers to data that electronically records sound information from the environment and is used to analyze specific sound patterns.

[0265] A "device" is a general term for computing equipment used to collect acoustic data and detect specific sound patterns.

[0266] A "processing device" is a computer system that analyzes received acoustic data and extracts features such as frequency, amplitude, and duration.

[0267] "Accumulated data" refers to audio data and its analysis results that have been collected in the past and stored within the system, and is used for comparison with new audio data.

[0268] A "generative AI model" is an artificial intelligence model that utilizes machine learning techniques to perform speech analysis and has the ability to continuously improve the accuracy of pattern recognition.

[0269] A "specific sound pattern" is a set of sound features that occur under specific circumstances, such as a baby crying, and is detected by the device based on set conditions.

[0270] This invention is a system that suggests appropriate responses to caregivers based on a baby's cries, and functions effectively through the collection, analysis, and notification of acoustic data. The system's components are a terminal, a server, and a user.

[0271] The device monitors ambient sounds using a high-sensitivity microphone built into a device such as a smartphone or tablet. When a baby's crying is detected, the sound is recorded and stored as audio data. The recorded data is encrypted for security purposes and sent to the server via a secure communication protocol (e.g., HTTPS).

[0272] After receiving acoustic data, the server analyzes the data using an audio processing algorithm. This algorithm extracts features such as frequency, amplitude, and duration from the audio. The features obtained from the analysis are compared with stored data, and a generative AI model is used to identify specific patterns in a baby's crying. This allows for the estimation of causes such as hunger, lack of sleep, stress, or poor health.

[0273] For example, if the server determines that "the baby is complaining of diaper discomfort," it sends that information to the device. The device then notifies the caregiver of this prediction and displays it as a message including specific action instructions. This allows the caregiver to quickly take care of the baby.

[0274] After users have taken action, they can provide feedback on the results to the system. This feedback information is sent to the server and used to train the generated AI model, leading to improvements in the accuracy of future analyses.

[0275] An example of a prompt might be, "Please identify the reason why the baby started crying and provide specific solutions." Based on this prompt, the system generates appropriate advice for the caregiver.

[0276] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0277] Step 1:

[0278] The device constantly monitors ambient sounds using its built-in microphone. Recording begins immediately upon detection of a specific sound pattern, such as a baby crying. The input is real-time acoustic data, and the output is recorded audio data. Specifically, noise filtering is applied to reduce background noise and clearly capture the baby's crying.

[0279] Step 2:

[0280] The terminal sends recorded audio data to the server using a secure communication protocol (such as HTTPS). The input is the recorded audio file, and the output is the transmission of encrypted audio data. The specific operation includes a process of data compression to improve the efficiency of the transfer.

[0281] Step 3:

[0282] The server analyzes the received voice data using a voice processing algorithm. In this analysis, features such as frequency, amplitude, and duration are extracted from the voice. The input is the voice data sent to the server, and the output is the analyzed feature data. A specific operation is a process of quantifying features using voice spectrum analysis.

[0283] Step 4:

[0284] The server compares the feature data obtained from the analysis with the accumulated data and uses the generated AI model to infer the cause of the crying sound. The input is the feature data and the accumulated database, and the output is the category of the inferred cause. As a specific operation, a machine learning model is recursively applied for pattern matching.

[0285] Step 5:

[0286] Based on the inferred cause, the server generates countermeasures for the caregiver and sends them to the terminal. The input is the category of the cause, and the output is a message containing specific action instructions for the caregiver. As a specific operation, the selection of the optimal action from the database is performed.

[0287] Step 6:

[0288] The terminal notifies the caregiver of the received action instructions. The input is the message from the server, and the output is a pop-up notification or push notification on the terminal. Specific operations include the visual display of messages on the user interface.

[0289] Step 7:

[0290] The user provides care for the baby based on the instructions given and feeds the results back to the system. The input is the actions taken by the caregiver and their results, and the output is the data sent to the server as feedback information. Specifically, the caregiver provides feedback by selecting options and entering comments through the app.

[0291] (Application Example 1)

[0292] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0293] Accurately interpreting a baby's cries and promptly providing appropriate advice to caregivers is crucial in reducing the burden of childcare. However, conventional childcare support systems have limitations in voice analysis and lack the ability to accurately predict various conditions of a baby. Furthermore, there is room for improvement in methods that shorten the time it takes for caregivers to take action and support specific childcare actions. To address these challenges, a system is needed that provides more accurate voice analysis and the rapid and accurate childcare support based on that analysis.

[0294] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0295] In this invention, the server includes an information processing means for collecting the baby's voice along with environmental information, an information processing device for analyzing the voice and extracting features, and an information processing device for inferring the baby's condition based on the features and generating countermeasures. This makes it possible to classify the diverse conditions of the baby with high accuracy and to quickly provide specific action guidelines to caregivers.

[0296] An "information processing device" is a device for collecting audio and environmental information and processing it as data.

[0297] An "information processing device" is a device that extracts and analyzes features from collected audio data.

[0298] A "machine" is a device that operates automatically based on notified countermeasures to assist with childcare.

[0299] A "past learning data recording device" is a device that stores previously collected data and learning results, and compares them with new data.

[0300] "Action instructions" are specific guidance that elicits particular actions and indicates how caregivers or machines should respond.

[0301] To implement this invention, it is necessary to construct a system that collects a baby's voice and provides appropriate advice to the caregiver. This system consists of an information processing means, an information processing device, a machine, and a device for recording past learning data.

[0302] The device is installed as a home robot and uses a high-sensitivity microphone to constantly monitor ambient sounds. When it detects a baby crying, it records the audio data and sends it to a server. The server uses speech processing software such as Google Cloud Speech-to-Text or IBM Watson to analyze the transmitted audio data and extract features such as frequency, amplitude, and duration. By comparing the extracted features with past training data from a recording device, the system identifies the baby's condition and generates the optimal response accordingly.

[0303] The generated countermeasures are notified to a home robot, which acts as a terminal, and the robot provides information to the caregiver using voice and a display. For example, if the baby starts crying in the middle of the night, the caregiver will receive a notification saying, "The baby may be getting sleepy." Furthermore, the robot will take actions such as playing a gentle lullaby to soothe the baby.

[0304] Thus, the present invention aims to provide quick and accurate childcare support based on the crying sound of a baby and reduce the burden on the caregiver.

[0305] As an example of a prompt sentence for the generative AI model, a sentence such as "I recorded the crying sound of a baby. Please send this data to the server and generate optimal childcare advice based on the analysis results." is used.

[0306] The flow of the specific process in Application Example 1 will be described using FIG. 12.

[0307] Step 1:

[0308] The terminal constantly monitors the ambient sound using a built-in high-sensitivity microphone. When it detects the crying sound of a baby, it records the voice. This recorded data serves as the input. The recorded voice data is processed as a digital signal and prepared for transmission to the server.

[0309] Step 2:

[0310] The server receives the voice data transmitted from the terminal. The input voice data is analyzed using Google Cloud Speech-to-Text or IBM Watson. Data processing is performed to extract features such as the frequency characteristics, amplitude, and duration of the voice, and based on the results, the pattern of the crying sound is identified.

[0311] Step 3:

[0312] The server estimates the state of the baby (hunger, sleepiness, stress, etc.) by comparing the extracted voice features with the past learning data recording device. The data calculation used here is pattern matching by a machine learning algorithm. The estimated state is obtained as the output.

[0313] Step 4:

[0314] The server generates appropriate countermeasures based on the estimated state of the baby. This generation process refers to a pre-entered database of childcare advice and outputs the most suitable message. For example, it might generate specific advice such as, "The baby seems sleepy."

[0315] Step 5:

[0316] Notification information is sent to the device. The device receives the message and provides the information to the caregiver through a display or audio output device. The information is displayed using an intuitive interface to make it easier for the caregiver to check the appropriate course of action.

[0317] Step 6:

[0318] Users (caregivers) take specific caregiving actions based on the advice they receive. For example, they might try to get their baby to sleep or breastfeed. They can provide feedback on the actions they take, which helps improve the accuracy of future analyses.

[0319] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0320] This invention is a system that aims to provide optimal childcare support by considering not only the baby's cries but also the caregiver's emotional state. This system provides comprehensive support to caregivers by integrating voice analysis, emotion recognition, generation of appropriate countermeasures, and notification thereof.

[0321] The device collects the baby's cries using a microphone as usual and sends the recorded data to a server. At the same time, the device is equipped with the user's camera, which captures the user's facial expressions and provides information to the emotion engine.

[0322] The server processes the baby's crying data using a voice analysis algorithm to extract features. This result is then compared with a machine learning model to estimate the baby's state. Additionally, based on the user's facial expressions captured by the camera, an emotion engine analyzes the user's emotions and determines their state (e.g., stress, fatigue, calmness).

[0323] Based on this analysis, the server creates a response optimized for the baby's condition and the user's emotions. For example, if the baby is hungry and the user is tired, it provides simple, quick action plans and a list of recommended baby products.

[0324] The server then sends the generated notification to the device, which provides the user with visual and auditory notifications. The notification content is designed to be easy to understand and act upon, taking into account the user's situation, with the aim of reducing parenting stress.

[0325] As a concrete example, consider a scenario where a baby starts crying at night. The device detects the crying and sends it to the server, while simultaneously capturing the user's facial expression with a camera. If the server determines the baby is "hungry" and the user is "tired," a specific and simple message such as "Feeding is needed. We recommend preparing formula milk" is generated and sent as a notification from the device. This allows caregivers to take appropriate action quickly, reducing their burden.

[0326] The following describes the processing flow.

[0327] Step 1:

[0328] The device records the baby's cries using a microphone and constantly monitors the surrounding sounds. This process is configured to start recording when it detects sounds exceeding a certain volume or frequency.

[0329] Step 2:

[0330] The device uses the user's camera to capture facial expressions. It provides facial recognition technology and collects facial expression data in real time.

[0331] Step 3:

[0332] The device formats the recorded audio and facial expression data appropriately and sends it to the server using a secure protocol.

[0333] Step 4:

[0334] The server receives the audio data and uses an audio analysis algorithm to extract the characteristics of the crying sounds. This analysis evaluates data such as sound patterns, length, and intervals.

[0335] Step 5:

[0336] The server uses a pre-trained machine learning model to analyze the characteristics of the baby's cries and predict its condition. This is then categorized into specific states such as hunger, sleepiness, or illness.

[0337] Step 6:

[0338] The server simultaneously analyzes facial expression data using an emotion engine to determine the user's emotional state (e.g., tired, stressed). It focuses on analyzing the movement of facial muscles and eye movements.

[0339] Step 7:

[0340] The server integrates the baby's estimated state with the user's emotional state to generate the optimal response. This proposal includes specific action instructions tailored to the baby's state and support measures that take the user's emotional state into consideration.

[0341] Step 8:

[0342] The server generates and delivers countermeasures to the terminal. These include not only action instructions but also encouragement and advice that takes into account the user's psychological state.

[0343] Step 9:

[0344] The device sends notifications to the user. These notifications are displayed visually as pop-up messages, and audio and vibration feedback can also be configured.

[0345] Step 10:

[0346] The user checks the notification and takes action based on the suggested response. At this time, the user can input the results of their actions as feedback into their device.

[0347] Step 11:

[0348] The server receives feedback from users and updates the database based on that feedback. This makes it possible to improve the accuracy of future analyses.

[0349] (Example 2)

[0350] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0351] In childcare, quickly determining the cause of a baby's crying and appropriately communicating the necessary response to the caregiver is a challenging task. Furthermore, simply offering general solutions without considering the caregiver's emotional state can increase their burden. This invention aims to provide optimal support by considering not only the baby's condition but also the caregiver's emotional state.

[0352] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0353] In this invention, the server includes means for analyzing speech and extracting features, means for recognizing the emotional state of the caregiver, and means for integrating the aforementioned speech features and the caregiver's emotional state to generate an optimal response. This makes it possible to provide optimal childcare support based on the state of the person being cared for and the emotions of the caregiver.

[0354] An "electronic device" is a device used to collect digital audio data and to record the crying of a child.

[0355] A "data processing device" is a computer device used to analyze collected audio data and extract its characteristics.

[0356] A "central control unit" is a device that has the function of generating optimal childcare responses based on voice characteristics and the emotional state of the caregiver.

[0357] A "display device" is a device that notifies the caregiver of the generated countermeasures visually or audibly.

[0358] "Target of childcare" refers to babies and infants who are the target of support in this invention.

[0359] "The emotional state of a caregiver" refers to the emotional conditions that caregivers experience during childcare, such as stress, fatigue, and relaxation.

[0360] "Optimal response measures" refer to specific and practical instructions and recommendations for childcare that are generated based on the condition of the person being cared for and the emotional state of the caregiver.

[0361] "Training data" refers to information accumulated from past analysis results of speech and emotions, and is a dataset used as a standard or reference in current analysis.

[0362] This invention is a comprehensive system for supporting childcare, providing optimal childcare support by analyzing the baby's cries and the caregiver's emotional state. The system mainly consists of terminals, a server, and the caregiver.

[0363] The device is an electronic device equipped with a microphone and a camera, which is used to record the baby's cries and capture the caregiver's facial expressions. The recorded audio data and captured image data are transmitted to a server in real time.

[0364] The server performs key data processing for speech analysis and emotion recognition. This analysis is carried out by applying speech analysis algorithms to the speech data. It also processes image data received from the camera using emotion recognition software to determine the emotional state of the caregiver. The accuracy of the analysis is improved by comparing the speech characteristics and the caregiver's emotional state with past training data.

[0365] Past training data is accumulated by machine learning algorithms and forms the basis for the system to understand voice and facial expression patterns. Based on this, the server uses a generative AI model to create optimal responses. These responses are then communicated to caregivers as visual or audio messages. The notifications are specific and simple to enable caregivers to respond quickly.

[0366] To give a specific example, if a baby starts crying at night, the device quickly detects the crying and simultaneously assesses the caregiver's facial expression. For instance, if the device determines that the baby is "hungry" and the caregiver is "tired," it generates a message such as, "The baby needs to be fed. Please prepare the formula." This allows caregivers to efficiently resolve childcare challenges.

[0367] Examples of input prompts for the generative AI model include specific instructions such as, "The baby is crying. Based on the audio data and the user's emotions, please provide appropriate childcare solutions." This allows the system to provide accurate childcare support based on a combination of data.

[0368] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0369] Step 1:

[0370] The device uses a microphone to collect the baby's cries in real time. The cries, as input, are converted into digital audio data and stored on the device. Simultaneously, a camera is used to capture the caregiver's facial expressions and record them as image data. This allows for the simultaneous collection of audio and image data.

[0371] Step 2:

[0372] The device transmits the collected audio and image data to the server via the internet. During this transmission process, the data is encrypted to ensure data security. The output is the transmitted audio and image dataset.

[0373] Step 3:

[0374] The server feeds the received audio data to a speech analysis algorithm. Data processing is performed to extract features such as pitch, intensity, and rhythm from the input audio data. This output feature data is then compared with a machine learning model to classify the baby's state into categories such as "hungry," "sleepy," and "uncomfortable."

[0375] Step 4:

[0376] The server analyzes the emotional state of the caregiver using image data obtained from the camera. Emotion recognition software evaluates the movement of facial muscles and facial patterns, and determines emotions such as "stress" and "fatigue" from the input images. This process yields emotional state as feature data.

[0377] Step 5:

[0378] The server integrates the obtained voice characteristics with the caregiver's emotional state and generates the optimal response using a generative AI model. Here, voice data and emotional data are processed as input, and the necessary actions and recommendations are concretized. The output is an action plan optimized for both the baby and the caregiver.

[0379] Step 6:

[0380] The server sends the generated countermeasures to the terminal. The terminal receives this information and notifies the caregiver using the display screen or speaker. The notification includes visual messages and audio guidance to deliver quick and clear instructions to the caregiver.

[0381] Through these steps, the system can process the input data and provide parents with rational and practical solutions to difficult situations in childcare.

[0382] (Application Example 2)

[0383] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0384] In today's childcare environment, where it is difficult to respond appropriately to a baby's crying, the challenge is to alleviate the mental and physical burden on caregivers and provide prompt and effective childcare support. Furthermore, because the emotional state of the caregiver themselves is not taken into consideration, the current support is not truly meaningful for them.

[0385] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0386] In this invention, the server includes acoustic detection means for collecting the baby's voice, visual information acquisition means for capturing the caregiver's facial expressions and analyzing their emotions, and a central processing unit for estimating the baby's and caregiver's conditions based on the aforementioned characteristics and the caregiver's emotions, and for generating countermeasures. This enables comprehensive childcare support that takes into account both the baby's condition and the caregiver's emotions.

[0387] "Acoustic detection means" refers to a sensor device used to collect the sound of a baby crying.

[0388] A "processing device" is a device used to analyze collected audio data and extract its characteristics.

[0389] A "visual information acquisition device" is a device that captures the facial expressions of a caregiver and analyzes their emotional state.

[0390] A "central processing unit" is a computing device that estimates the baby's condition and the caregiver's condition based on audio and visual data, and generates the optimal course of action.

[0391] An "output device" is a device that provides notifications to caregivers visually or audibly.

[0392] This invention relates to a system that analyzes a baby's voice and the caregiver's emotions to provide optimal childcare support. The system comprises an acoustic detection means, a visual information acquisition means, a central processing unit, and an output device.

[0393] First, an acoustic detection system collects the baby's cries. This hardware includes an audio input device such as a high-sensitivity microphone. For visual information acquisition, a camera is used to capture the caregiver's facial expressions. Specific examples include a standard USB camera or a camera built into a smartphone.

[0394] The server analyzes the collected audio data using a processing unit and extracts its features. This process utilizes an audio analysis algorithm to analyze the patterns of the baby's cries. Next, the facial expression data of the caregiver, obtained using visual information acquisition means, is analyzed by emotion recognition software. Specific software used includes "EmotionEngine," among others.

[0395] Based on the analysis results, the central processing unit infers the baby's condition and the caregiver's emotions, and generates the optimal course of action based on this information. This involves matching the data with a database that includes machine learning models. The generated course of action is notified to the caregiver visually or audibly through an output device. Specific examples include push notifications and voice guidance via smartphones or smart glasses.

[0396] This invention makes it possible to provide accurate childcare support that takes into account the baby's crying and the caregiver's condition, thereby reducing the burden on the caregiver.

[0397] Example of a prompt:

[0398] "Enter a Japanese sentence. For example, 'The baby has started crying. The mother appears tired.' Please generate the best parenting advice for this sentence."

[0399] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0400] Step 1:

[0401] The user activates the acoustic detection device near the baby. The acoustic detection device collects the baby's cries through a microphone. The input is real-time audio data, which is converted into digital data and sent to the processing unit as an audio signal. The output is digital audio data to be analyzed.

[0402] Step 2:

[0403] The server analyzes the audio data acquired using a processing unit. It receives digital audio data as input and applies an audio analysis algorithm. By extracting features from the data and analyzing the patterns of the baby's cries, it infers the baby's state (e.g., hungry, sleepy, etc.). The output is information indicating the baby's state.

[0404] Step 3:

[0405] The user activates a visual information acquisition device and captures their facial expressions with a camera. The input is real-time video data, which is sent to the server as a digital image. The output is video data ready for emotion analysis.

[0406] Step 4:

[0407] The server analyzes the user's video data using emotion recognition software. It receives digital video data as input and applies an emotion analysis algorithm. It determines the user's emotional state (e.g., stress, fatigue) and generates information indicating the emotional state as output.

[0408] Step 5:

[0409] The server integrates baby state information obtained from voice analysis and user emotional state information obtained from emotion analysis in the central processing unit. The input consists of both the baby's state and the user's emotional state. This information is compared with a machine learning model in the database to generate the optimal response. The output is a message containing specific childcare support measures to notify the user.

[0410] Step 6:

[0411] The server notifies the user of generated childcare support messages visually or audibly through an output device. The input is the generated message, and the output is the information the user receives. Specific examples include push notifications and voice guidance via smartphones and smart glasses.

[0412] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0413] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (Internet Search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0414] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0415] [Third Embodiment]

[0416] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0417] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0418] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0419] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0420] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0421] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0422] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0423] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0424] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0425] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0426] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0427] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0428] This invention is a system that suggests appropriate responses to caregivers based on the sounds of a baby crying. This system is realized by collecting audio via a smart device and analyzing that information on a server.

[0429] First, the device monitors ambient sounds through its built-in microphone and begins recording when it detects a baby crying. This recorded data is then transmitted to a server via a secure network.

[0430] Next, the server analyzes the received audio data. This involves using audio processing algorithms to extract features such as frequency, amplitude, and duration of the sound. This analysis identifies crying patterns and classifies the baby's condition (hungry, sleepy, stressed, unwell, etc.).

[0431] The server then uses its pre-programmed learning database to predict the most appropriate course of action for the baby's condition. For example, if the baby is determined to be hungry, it will generate specific advice such as, "It's time to breastfeed."

[0432] The server then sends this solution to the device, and the user is notified. The device displays this as a pop-up message or push notification, providing the information in a way that is easy for caregivers to check and act upon.

[0433] Finally, users can provide feedback to the system about the actions they actually took. This allows the server to learn from the new data and further improve the accuracy of future analyses.

[0434] As a concrete example, consider a scenario where a baby starts crying. At this point, the device records the audio, and the server notifies the user of the analysis result, stating, "The baby's diaper may be wet." The user checks the diaper, and if it is indeed wet, provides feedback, which updates the database. This collaboration improves the quality of service in subsequent instances.

[0435] The following describes the processing flow.

[0436] Step 1:

[0437] The device constantly monitors ambient noise and starts recording when it detects a baby's crying exceeding a certain decibel level.

[0438] Step 2:

[0439] The device compresses the recorded audio data and sends it to the server using a secure protocol.

[0440] Step 3:

[0441] The server decompresses the received audio data and performs signal analysis. It extracts features such as frequency and volume to analyze the crying patterns.

[0442] Step 4:

[0443] The server applies crying patterns to a machine learning model and compares them with past training data to infer the baby's condition (hunger, sleepiness, wet diaper, health problems, etc.).

[0444] Step 5:

[0445] The server generates advice on countermeasures based on the inferred state. This includes specific action points and suggestions for childcare products.

[0446] Step 6:

[0447] The server sends the generated advice back to the terminal.

[0448] Step 7:

[0449] The device notifies the user of any advice it receives. Notifications are delivered via screen display, audio output, vibration, etc.

[0450] Step 8:

[0451] The user reviews the advice and then takes specific actions to take with the baby based on it.

[0452] Step 9:

[0453] The user inputs the results of the actions they have taken into the terminal and provides feedback to the system.

[0454] Step 10:

[0455] The server receives user feedback and updates the database to improve the accuracy of future analyses.

[0456] (Example 1)

[0457] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0458] In modern times, caregivers are required to quickly and accurately determine the cause of a baby's crying, but the technology to achieve this is still not sufficiently developed. This is especially burdensome for caregivers with little experience in distinguishing between newborn cries, so there is a need for a system that can more accurately identify the cause of crying and suggest appropriate responses.

[0459] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0460] In this invention, the server includes means for using a device that collects acoustic data and detects specific sound patterns; means for using a processing device that analyzes the acoustic data and extracts features such as frequency, amplitude, and duration; and means for inferring the cause of the sound by comparing the features with previously stored data. This enables caregivers to obtain quick and accurate countermeasures.

[0461] "Acoustic data" refers to data that electronically records sound information from the environment and is used to analyze specific sound patterns.

[0462] A "device" is a general term for computing equipment used to collect acoustic data and detect specific sound patterns.

[0463] A "processing device" is a computer system that analyzes received acoustic data and extracts features such as frequency, amplitude, and duration.

[0464] "Accumulated data" refers to audio data and its analysis results that have been collected in the past and stored within the system, and is used for comparison with new audio data.

[0465] A "generative AI model" is an artificial intelligence model that utilizes machine learning techniques to perform speech analysis and has the ability to continuously improve the accuracy of pattern recognition.

[0466] A "specific sound pattern" is a set of sound features that occur under specific circumstances, such as a baby crying, and is detected by the device based on set conditions.

[0467] This invention is a system that suggests appropriate responses to caregivers based on a baby's cries, and functions effectively through the collection, analysis, and notification of acoustic data. The system's components are a terminal, a server, and a user.

[0468] The device monitors ambient sounds using a high-sensitivity microphone built into a device such as a smartphone or tablet. When a baby's crying is detected, the sound is recorded and stored as audio data. The recorded data is encrypted for security purposes and sent to the server via a secure communication protocol (e.g., HTTPS).

[0469] After receiving acoustic data, the server analyzes the data using an audio processing algorithm. This algorithm extracts features such as frequency, amplitude, and duration from the audio. The features obtained from the analysis are compared with stored data, and a generative AI model is used to identify specific patterns in a baby's crying. This allows for the estimation of causes such as hunger, lack of sleep, stress, or poor health.

[0470] For example, if the server determines that "the baby is complaining of diaper discomfort," it sends that information to the device. The device then notifies the caregiver of this prediction and displays it as a message including specific action instructions. This allows the caregiver to quickly take care of the baby.

[0471] After users have taken action, they can provide feedback on the results to the system. This feedback information is sent to the server and used to train the generated AI model, leading to improvements in the accuracy of future analyses.

[0472] An example of a prompt might be, "Please identify the reason why the baby started crying and provide specific solutions." Based on this prompt, the system generates appropriate advice for the caregiver.

[0473] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0474] Step 1:

[0475] The device constantly monitors ambient sounds using its built-in microphone. Recording begins immediately upon detection of a specific sound pattern, such as a baby crying. The input is real-time acoustic data, and the output is recorded audio data. Specifically, noise filtering is applied to reduce background noise and clearly capture the baby's crying.

[0476] Step 2:

[0477] The terminal sends recorded audio data to the server using a secure communication protocol (such as HTTPS). The input is the recorded audio file, and the output is the transmission of encrypted audio data. The specific operation includes a process of data compression to improve the efficiency of the transfer.

[0478] Step 3:

[0479] The server analyzes the received audio data using an audio processing algorithm. This analysis extracts features such as frequency, amplitude, and duration from the audio. The input is the audio data sent to the server, and the output is the analyzed feature data. Specifically, the process involves using audio spectral analysis to quantify the features.

[0480] Step 4:

[0481] The server compares the feature data obtained from the analysis with the stored data and uses a generative AI model to infer the cause of the crying. The input is the feature data and the stored database, and the output is the category of the inferred cause. Specifically, the machine learning model is applied recursively, and pattern matching is performed.

[0482] Step 5:

[0483] The server generates a response plan for the caregiver based on the suspected cause and sends it to the terminal. The input is the category of the cause, and the output is a message containing specific action instructions for the caregiver. Specifically, the server selects the optimal action from the database.

[0484] Step 6:

[0485] The device notifies the caregiver of the received instructions. Input is a message from the server, and output is a pop-up or push notification on the device. Specific actions include the visual display of messages on the user interface.

[0486] Step 7:

[0487] The user provides care for the baby based on the instructions given and feeds the results back to the system. The input is the actions taken by the caregiver and their results, and the output is the data sent to the server as feedback information. Specifically, the caregiver provides feedback by selecting options and entering comments through the app.

[0488] (Application Example 1)

[0489] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0490] Accurately interpreting a baby's cries and promptly providing appropriate advice to caregivers is crucial in reducing the burden of childcare. However, conventional childcare support systems have limitations in voice analysis and lack the ability to accurately predict various conditions of a baby. Furthermore, there is room for improvement in methods that shorten the time it takes for caregivers to take action and support specific childcare actions. To address these challenges, a system is needed that provides more accurate voice analysis and the rapid and accurate childcare support based on that analysis.

[0491] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0492] In this invention, the server includes an information processing means for collecting the baby's voice along with environmental information, an information processing device for analyzing the voice and extracting features, and an information processing device for inferring the baby's condition based on the features and generating countermeasures. This makes it possible to classify the diverse conditions of the baby with high accuracy and to quickly provide specific action guidelines to caregivers.

[0493] An "information processing device" is a device for collecting audio and environmental information and processing it as data.

[0494] An "information processing device" is a device that extracts and analyzes features from collected audio data.

[0495] A "machine" is a device that operates automatically based on notified countermeasures to assist with childcare.

[0496] A "past learning data recording device" is a device that stores previously collected data and learning results, and compares them with new data.

[0497] "Action instructions" are specific guidance that elicits particular actions and indicates how caregivers or machines should respond.

[0498] To implement this invention, it is necessary to construct a system that collects a baby's voice and provides appropriate advice to the caregiver. This system consists of an information processing means, an information processing device, a machine, and a device for recording past learning data.

[0499] The device is installed as a home robot and uses a high-sensitivity microphone to constantly monitor ambient sounds. When it detects a baby crying, it records the audio data and sends it to a server. The server uses speech processing software such as Google Cloud Speech-to-Text or IBM Watson to analyze the transmitted audio data and extract features such as frequency, amplitude, and duration. By comparing the extracted features with past training data from a recording device, the system identifies the baby's condition and generates the optimal response accordingly.

[0500] The generated countermeasures are notified to a home robot, which acts as a terminal, and the robot provides information to the caregiver using voice and a display. For example, if the baby starts crying in the middle of the night, the caregiver will receive a notification saying, "The baby may be getting sleepy." Furthermore, the robot will take actions such as playing a gentle lullaby to soothe the baby.

[0501] Thus, the present invention aims to provide prompt and accurate childcare support based on the sounds of a baby crying, thereby reducing the burden on childcare providers.

[0502] An example of a prompt message for the generative AI model would be: "I have recorded my baby crying. Please send this data to the server and generate optimal parenting advice based on the analysis results."

[0503] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0504] Step 1:

[0505] The device constantly monitors ambient sounds using its built-in high-sensitivity microphone, and when it detects a baby crying, it records the sound. This recorded data becomes the input. The recorded audio data is processed as a digital signal and prepared for transmission to the server.

[0506] Step 2:

[0507] The server receives audio data sent from the terminal. The input audio data is analyzed using Google Cloud Speech-to-Text or IBM Watson. Data processing is performed to extract features such as the frequency characteristics, amplitude, and duration of the audio, and the crying pattern is identified based on the results.

[0508] Step 3:

[0509] The server predicts the baby's state (hunger, sleepiness, stress, etc.) by comparing the extracted audio features with past training data recorded by a machine learning algorithm. The data calculation used here is pattern matching using a machine learning algorithm. The predicted state is obtained as the output.

[0510] Step 4:

[0511] The server generates appropriate countermeasures based on the estimated state of the baby. This generation process refers to a pre-entered database of childcare advice and outputs the most suitable message. For example, it might generate specific advice such as, "The baby seems sleepy."

[0512] Step 5:

[0513] Notification information is sent to the device. The device receives the message and provides the information to the caregiver through a display or audio output device. The information is displayed using an intuitive interface to make it easier for the caregiver to check the appropriate course of action.

[0514] Step 6:

[0515] Users (caregivers) take specific caregiving actions based on the advice they receive. For example, they might try to get their baby to sleep or breastfeed. They can provide feedback on the actions they take, which helps improve the accuracy of future analyses.

[0516] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0517] This invention is a system that aims to provide optimal childcare support by considering not only the baby's cries but also the caregiver's emotional state. This system provides comprehensive support to caregivers by integrating voice analysis, emotion recognition, generation of appropriate countermeasures, and notification thereof.

[0518] The device collects the baby's cries using a microphone as usual and sends the recorded data to a server. At the same time, the device is equipped with the user's camera, which captures the user's facial expressions and provides information to the emotion engine.

[0519] The server processes the baby's crying data using a voice analysis algorithm to extract features. This result is then compared with a machine learning model to estimate the baby's state. Additionally, based on the user's facial expressions captured by the camera, an emotion engine analyzes the user's emotions and determines their state (e.g., stress, fatigue, calmness).

[0520] Based on this analysis, the server creates a response optimized for the baby's condition and the user's emotions. For example, if the baby is hungry and the user is tired, it provides simple, quick action plans and a list of recommended baby products.

[0521] The server then sends the generated notification to the device, which provides the user with visual and auditory notifications. The notification content is designed to be easy to understand and act upon, taking into account the user's situation, with the aim of reducing parenting stress.

[0522] As a concrete example, consider a scenario where a baby starts crying at night. The device detects the crying and sends it to the server, while simultaneously capturing the user's facial expression with a camera. If the server determines the baby is "hungry" and the user is "tired," a specific and simple message such as "Feeding is needed. We recommend preparing formula milk" is generated and sent as a notification from the device. This allows caregivers to take appropriate action quickly, reducing their burden.

[0523] The following describes the processing flow.

[0524] Step 1:

[0525] The device records the baby's cries using a microphone and constantly monitors the surrounding sounds. This process is configured to start recording when it detects sounds exceeding a certain volume or frequency.

[0526] Step 2:

[0527] The device uses the user's camera to capture facial expressions. It provides facial recognition technology and collects facial expression data in real time.

[0528] Step 3:

[0529] The device formats the recorded audio and facial expression data appropriately and sends it to the server using a secure protocol.

[0530] Step 4:

[0531] The server receives the audio data and uses an audio analysis algorithm to extract the characteristics of the crying sounds. This analysis evaluates data such as sound patterns, length, and intervals.

[0532] Step 5:

[0533] The server uses a pre-trained machine learning model to analyze the characteristics of the baby's cries and predict its condition. This is then categorized into specific states such as hunger, sleepiness, or illness.

[0534] Step 6:

[0535] The server simultaneously analyzes facial expression data using an emotion engine to determine the user's emotional state (e.g., tired, stressed). It focuses on analyzing the movement of facial muscles and eye movements.

[0536] Step 7:

[0537] The server integrates the baby's estimated state with the user's emotional state to generate the optimal response. This proposal includes specific action instructions tailored to the baby's state and support measures that take the user's emotional state into consideration.

[0538] Step 8:

[0539] The server generates and delivers countermeasures to the terminal. These include not only action instructions but also encouragement and advice that takes into account the user's psychological state.

[0540] Step 9:

[0541] The device sends notifications to the user. These notifications are displayed visually as pop-up messages, and audio and vibration feedback can also be configured.

[0542] Step 10:

[0543] The user checks the notification and takes action based on the suggested response. At this time, the user can input the results of their actions as feedback into their device.

[0544] Step 11:

[0545] The server receives feedback from users and updates the database based on that feedback. This makes it possible to improve the accuracy of future analyses.

[0546] (Example 2)

[0547] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0548] In childcare, quickly determining the cause of a baby's crying and appropriately communicating the necessary response to the caregiver is a challenging task. Furthermore, simply offering general solutions without considering the caregiver's emotional state can increase their burden. This invention aims to provide optimal support by considering not only the baby's condition but also the caregiver's emotional state.

[0549] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0550] In this invention, the server includes means for analyzing speech and extracting features, means for recognizing the emotional state of the caregiver, and means for integrating the aforementioned speech features and the caregiver's emotional state to generate an optimal response. This makes it possible to provide optimal childcare support based on the state of the person being cared for and the emotions of the caregiver.

[0551] An "electronic device" is a device used to collect digital audio data and to record the crying of a child.

[0552] A "data processing device" is a computer device used to analyze collected audio data and extract its characteristics.

[0553] A "central control unit" is a device that has the function of generating optimal childcare responses based on voice characteristics and the emotional state of the caregiver.

[0554] A "display device" is a device that notifies the caregiver of the generated countermeasures visually or audibly.

[0555] "Target of childcare" refers to babies and infants who are the target of support in this invention.

[0556] "The emotional state of a caregiver" refers to the emotional conditions that caregivers experience during childcare, such as stress, fatigue, and relaxation.

[0557] "Optimal response measures" refer to specific and practical instructions and recommendations for childcare that are generated based on the condition of the person being cared for and the emotional state of the caregiver.

[0558] "Training data" refers to information accumulated from past analysis results of speech and emotions, and is a dataset used as a standard or reference in current analysis.

[0559] This invention is a comprehensive system for supporting childcare, providing optimal childcare support by analyzing the baby's cries and the caregiver's emotional state. The system mainly consists of terminals, a server, and the caregiver.

[0560] The device is an electronic device equipped with a microphone and a camera, which is used to record the baby's cries and capture the caregiver's facial expressions. The recorded audio data and captured image data are transmitted to a server in real time.

[0561] The server performs key data processing for speech analysis and emotion recognition. This analysis is carried out by applying speech analysis algorithms to the speech data. It also processes image data received from the camera using emotion recognition software to determine the emotional state of the caregiver. The accuracy of the analysis is improved by comparing the speech characteristics and the caregiver's emotional state with past training data.

[0562] Past training data is accumulated by machine learning algorithms and forms the basis for the system to understand voice and facial expression patterns. Based on this, the server uses a generative AI model to create optimal responses. These responses are then communicated to caregivers as visual or audio messages. The notifications are specific and simple to enable caregivers to respond quickly.

[0563] To give a specific example, if a baby starts crying at night, the device quickly detects the crying and simultaneously assesses the caregiver's facial expression. For instance, if the device determines that the baby is "hungry" and the caregiver is "tired," it generates a message such as, "The baby needs to be fed. Please prepare the formula." This allows caregivers to efficiently resolve childcare challenges.

[0564] Examples of input prompts for the generative AI model include specific instructions such as, "The baby is crying. Based on the audio data and the user's emotions, please provide appropriate childcare solutions." This allows the system to provide accurate childcare support based on a combination of data.

[0565] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0566] Step 1:

[0567] The device uses a microphone to collect the baby's cries in real time. The cries, as input, are converted into digital audio data and stored on the device. Simultaneously, a camera is used to capture the caregiver's facial expressions and record them as image data. This allows for the simultaneous collection of audio and image data.

[0568] Step 2:

[0569] The device transmits the collected audio and image data to the server via the internet. During this transmission process, the data is encrypted to ensure data security. The output is the transmitted audio and image dataset.

[0570] Step 3:

[0571] The server feeds the received audio data to a speech analysis algorithm. Data processing is performed to extract features such as pitch, intensity, and rhythm from the input audio data. This output feature data is then compared with a machine learning model to classify the baby's state into categories such as "hungry," "sleepy," and "uncomfortable."

[0572] Step 4:

[0573] The server analyzes the emotional state of the caregiver using image data obtained from the camera. Emotion recognition software evaluates the movement of facial muscles and facial patterns, and determines emotions such as "stress" and "fatigue" from the input images. This process yields emotional state as feature data.

[0574] Step 5:

[0575] The server integrates the obtained voice characteristics with the caregiver's emotional state and generates the optimal response using a generative AI model. Here, voice data and emotional data are processed as input, and the necessary actions and recommendations are concretized. The output is an action plan optimized for both the baby and the caregiver.

[0576] Step 6:

[0577] The server sends the generated countermeasures to the terminal. The terminal receives this information and notifies the caregiver using the display screen or speaker. The notification includes visual messages and audio guidance to deliver quick and clear instructions to the caregiver.

[0578] Through these steps, the system can process the input data and provide parents with rational and practical solutions to difficult situations in childcare.

[0579] (Application Example 2)

[0580] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0581] In today's childcare environment, where it is difficult to respond appropriately to a baby's crying, the challenge is to alleviate the mental and physical burden on caregivers and provide prompt and effective childcare support. Furthermore, because the emotional state of the caregiver themselves is not taken into consideration, the current support is not truly meaningful for them.

[0582] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0583] In this invention, the server includes acoustic detection means for collecting the baby's voice, visual information acquisition means for capturing the caregiver's facial expressions and analyzing their emotions, and a central processing unit for estimating the baby's and caregiver's conditions based on the aforementioned characteristics and the caregiver's emotions, and for generating countermeasures. This enables comprehensive childcare support that takes into account both the baby's condition and the caregiver's emotions.

[0584] "Acoustic detection means" refers to a sensor device used to collect the sound of a baby crying.

[0585] A "processing device" is a device used to analyze collected audio data and extract its characteristics.

[0586] A "visual information acquisition device" is a device that captures the facial expressions of a caregiver and analyzes their emotional state.

[0587] A "central processing unit" is a computing device that estimates the baby's condition and the caregiver's condition based on audio and visual data, and generates the optimal course of action.

[0588] An "output device" is a device that provides notifications to caregivers visually or audibly.

[0589] This invention relates to a system that analyzes a baby's voice and the caregiver's emotions to provide optimal childcare support. The system comprises an acoustic detection means, a visual information acquisition means, a central processing unit, and an output device.

[0590] First, an acoustic detection system collects the baby's cries. This hardware includes an audio input device such as a high-sensitivity microphone. For visual information acquisition, a camera is used to capture the caregiver's facial expressions. Specific examples include a standard USB camera or a camera built into a smartphone.

[0591] The server analyzes the collected audio data using a processing unit and extracts its features. This process utilizes an audio analysis algorithm to analyze the patterns of the baby's cries. Next, the facial expression data of the caregiver, obtained using visual information acquisition means, is analyzed by emotion recognition software. Specific software used includes "EmotionEngine," among others.

[0592] Based on the analysis results, the central processing unit infers the baby's condition and the caregiver's emotions, and generates the optimal course of action based on this information. This involves matching the data with a database that includes machine learning models. The generated course of action is notified to the caregiver visually or audibly through an output device. Specific examples include push notifications and voice guidance via smartphones or smart glasses.

[0593] This invention makes it possible to provide accurate childcare support that takes into account the baby's crying and the caregiver's condition, thereby reducing the burden on the caregiver.

[0594] Example of a prompt:

[0595] "Enter a Japanese sentence. For example, 'The baby has started crying. The mother appears tired.' Please generate the best parenting advice for this sentence."

[0596] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0597] Step 1:

[0598] The user activates the acoustic detection device near the baby. The acoustic detection device collects the baby's cries through a microphone. The input is real-time audio data, which is converted into digital data and sent to the processing unit as an audio signal. The output is digital audio data to be analyzed.

[0599] Step 2:

[0600] The server analyzes the audio data acquired using a processing unit. It receives digital audio data as input and applies an audio analysis algorithm. By extracting features from the data and analyzing the patterns of the baby's cries, it infers the baby's state (e.g., hungry, sleepy, etc.). The output is information indicating the baby's state.

[0601] Step 3:

[0602] The user activates a visual information acquisition device and captures their facial expressions with a camera. The input is real-time video data, which is sent to the server as a digital image. The output is video data ready for emotion analysis.

[0603] Step 4:

[0604] The server analyzes the user's video data using emotion recognition software. It receives digital video data as input and applies an emotion analysis algorithm. It determines the user's emotional state (e.g., stress, fatigue) and generates information indicating the emotional state as output.

[0605] Step 5:

[0606] The server integrates baby state information obtained from voice analysis and user emotional state information obtained from emotion analysis in the central processing unit. The input consists of both the baby's state and the user's emotional state. This information is compared with a machine learning model in the database to generate the optimal response. The output is a message containing specific childcare support measures to notify the user.

[0607] Step 6:

[0608] The server notifies the user of generated childcare support messages visually or audibly through an output device. The input is the generated message, and the output is the information the user receives. Specific examples include push notifications and voice guidance via smartphones and smart glasses.

[0609] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0610] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (Internet Search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0611] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0612] [Fourth Embodiment]

[0613] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0614] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0615] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0616] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0617] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0618] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0619] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0620] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0621] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0622] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0623] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0624] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0625] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0626] This invention is a system that suggests appropriate responses to caregivers based on the sounds of a baby crying. This system is realized by collecting audio via a smart device and analyzing that information on a server.

[0627] First, the device monitors ambient sounds through its built-in microphone and begins recording when it detects a baby crying. This recorded data is then transmitted to a server via a secure network.

[0628] Next, the server analyzes the received audio data. This involves using audio processing algorithms to extract features such as frequency, amplitude, and duration of the sound. This analysis identifies crying patterns and classifies the baby's condition (hungry, sleepy, stressed, unwell, etc.).

[0629] The server then uses its pre-programmed learning database to predict the most appropriate course of action for the baby's condition. For example, if the baby is determined to be hungry, it will generate specific advice such as, "It's time to breastfeed."

[0630] The server then sends this solution to the device, and the user is notified. The device displays this as a pop-up message or push notification, providing the information in a way that is easy for caregivers to check and act upon.

[0631] Finally, users can provide feedback to the system about the actions they actually took. This allows the server to learn from the new data and further improve the accuracy of future analyses.

[0632] As a concrete example, consider a scenario where a baby starts crying. At this point, the device records the audio, and the server notifies the user of the analysis result, stating, "The baby's diaper may be wet." The user checks the diaper, and if it is indeed wet, provides feedback, which updates the database. This collaboration improves the quality of service in subsequent instances.

[0633] The following describes the processing flow.

[0634] Step 1:

[0635] The device constantly monitors ambient noise and starts recording when it detects a baby's crying exceeding a certain decibel level.

[0636] Step 2:

[0637] The device compresses the recorded audio data and sends it to the server using a secure protocol.

[0638] Step 3:

[0639] The server decompresses the received audio data and performs signal analysis. It extracts features such as frequency and volume to analyze the crying patterns.

[0640] Step 4:

[0641] The server applies crying patterns to a machine learning model and compares them with past training data to infer the baby's condition (hunger, sleepiness, wet diaper, health problems, etc.).

[0642] Step 5:

[0643] The server generates advice on countermeasures based on the inferred state. This includes specific action points and suggestions for childcare products.

[0644] Step 6:

[0645] The server sends the generated advice back to the terminal.

[0646] Step 7:

[0647] The device notifies the user of any advice it receives. Notifications are delivered via screen display, audio output, vibration, etc.

[0648] Step 8:

[0649] The user reviews the advice and then takes specific actions to take with the baby based on it.

[0650] Step 9:

[0651] The user inputs the results of the actions they have taken into the terminal and provides feedback to the system.

[0652] Step 10:

[0653] The server receives user feedback and updates the database to improve the accuracy of future analyses.

[0654] (Example 1)

[0655] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0656] In modern times, caregivers are required to quickly and accurately determine the cause of a baby's crying, but the technology to achieve this is still not sufficiently developed. This is especially burdensome for caregivers with little experience in distinguishing between newborn cries, so there is a need for a system that can more accurately identify the cause of crying and suggest appropriate responses.

[0657] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0658] In this invention, the server includes means for using a device that collects acoustic data and detects specific sound patterns; means for using a processing device that analyzes the acoustic data and extracts features such as frequency, amplitude, and duration; and means for inferring the cause of the sound by comparing the features with previously stored data. This enables caregivers to obtain quick and accurate countermeasures.

[0659] "Acoustic data" refers to data that electronically records sound information from the environment and is used to analyze specific sound patterns.

[0660] A "device" is a general term for computing equipment used to collect acoustic data and detect specific sound patterns.

[0661] A "processing device" is a computer system that analyzes received acoustic data and extracts features such as frequency, amplitude, and duration.

[0662] "Accumulated data" refers to audio data and its analysis results that have been collected in the past and stored within the system, and is used for comparison with new audio data.

[0663] A "generative AI model" is an artificial intelligence model that utilizes machine learning techniques to perform speech analysis and has the ability to continuously improve the accuracy of pattern recognition.

[0664] A "specific sound pattern" is a set of sound features that occur under specific circumstances, such as a baby crying, and is detected by the device based on set conditions.

[0665] This invention is a system that suggests appropriate responses to caregivers based on a baby's cries, and functions effectively through the collection, analysis, and notification of acoustic data. The system's components are a terminal, a server, and a user.

[0666] The device monitors ambient sounds using a high-sensitivity microphone built into a device such as a smartphone or tablet. When a baby's crying is detected, the sound is recorded and stored as audio data. The recorded data is encrypted for security purposes and sent to the server via a secure communication protocol (e.g., HTTPS).

[0667] After receiving acoustic data, the server analyzes the data using an audio processing algorithm. This algorithm extracts features such as frequency, amplitude, and duration from the audio. The features obtained from the analysis are compared with stored data, and a generative AI model is used to identify specific patterns in a baby's crying. This allows for the estimation of causes such as hunger, lack of sleep, stress, or poor health.

[0668] For example, if the server determines that "the baby is complaining of diaper discomfort," it sends that information to the device. The device then notifies the caregiver of this prediction and displays it as a message including specific action instructions. This allows the caregiver to quickly take care of the baby.

[0669] After users have taken action, they can provide feedback on the results to the system. This feedback information is sent to the server and used to train the generated AI model, leading to improvements in the accuracy of future analyses.

[0670] An example of a prompt might be, "Please identify the reason why the baby started crying and provide specific solutions." Based on this prompt, the system generates appropriate advice for the caregiver.

[0671] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0672] Step 1:

[0673] The device constantly monitors ambient sounds using its built-in microphone. Recording begins immediately upon detection of a specific sound pattern, such as a baby crying. The input is real-time acoustic data, and the output is recorded audio data. Specifically, noise filtering is applied to reduce background noise and clearly capture the baby's crying.

[0674] Step 2:

[0675] The terminal sends recorded audio data to the server using a secure communication protocol (such as HTTPS). The input is the recorded audio file, and the output is the transmission of encrypted audio data. The specific operation includes a process of data compression to improve the efficiency of the transfer.

[0676] Step 3:

[0677] The server analyzes the received audio data using an audio processing algorithm. This analysis extracts features such as frequency, amplitude, and duration from the audio. The input is the audio data sent to the server, and the output is the analyzed feature data. Specifically, the process involves using audio spectral analysis to quantify the features.

[0678] Step 4:

[0679] The server compares the feature data obtained from the analysis with the stored data and uses a generative AI model to infer the cause of the crying. The input is the feature data and the stored database, and the output is the category of the inferred cause. Specifically, the machine learning model is applied recursively, and pattern matching is performed.

[0680] Step 5:

[0681] The server generates a response plan for the caregiver based on the suspected cause and sends it to the terminal. The input is the category of the cause, and the output is a message containing specific action instructions for the caregiver. Specifically, the server selects the optimal action from the database.

[0682] Step 6:

[0683] The device notifies the caregiver of the received instructions. Input is a message from the server, and output is a pop-up or push notification on the device. Specific actions include the visual display of messages on the user interface.

[0684] Step 7:

[0685] The user provides care for the baby based on the instructions given and feeds the results back to the system. The input is the actions taken by the caregiver and their results, and the output is the data sent to the server as feedback information. Specifically, the caregiver provides feedback by selecting options and entering comments through the app.

[0686] (Application Example 1)

[0687] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0688] Accurately interpreting a baby's cries and promptly providing appropriate advice to caregivers is crucial in reducing the burden of childcare. However, conventional childcare support systems have limitations in voice analysis and lack the ability to accurately predict various conditions of a baby. Furthermore, there is room for improvement in methods that shorten the time it takes for caregivers to take action and support specific childcare actions. To address these challenges, a system is needed that provides more accurate voice analysis and the rapid and accurate childcare support based on that analysis.

[0689] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0690] In this invention, the server includes an information processing means for collecting the baby's voice along with environmental information, an information processing device for analyzing the voice and extracting features, and an information processing device for inferring the baby's condition based on the features and generating countermeasures. This makes it possible to classify the diverse conditions of the baby with high accuracy and to quickly provide specific action guidelines to caregivers.

[0691] An "information processing device" is a device for collecting audio and environmental information and processing it as data.

[0692] An "information processing device" is a device that extracts and analyzes features from collected audio data.

[0693] A "machine" is a device that operates automatically based on notified countermeasures to assist with childcare.

[0694] A "past learning data recording device" is a device that stores previously collected data and learning results, and compares them with new data.

[0695] "Action instructions" are specific guidance that elicits particular actions and indicates how caregivers or machines should respond.

[0696] To implement this invention, it is necessary to construct a system that collects a baby's voice and provides appropriate advice to the caregiver. This system consists of an information processing means, an information processing device, a machine, and a device for recording past learning data.

[0697] The device is installed as a home robot and uses a high-sensitivity microphone to constantly monitor ambient sounds. When it detects a baby crying, it records the audio data and sends it to a server. The server uses speech processing software such as Google Cloud Speech-to-Text or IBM Watson to analyze the transmitted audio data and extract features such as frequency, amplitude, and duration. By comparing the extracted features with past training data from a recording device, the system identifies the baby's condition and generates the optimal response accordingly.

[0698] The generated countermeasures are notified to a home robot, which acts as a terminal, and the robot provides information to the caregiver using voice and a display. For example, if the baby starts crying in the middle of the night, the caregiver will receive a notification saying, "The baby may be getting sleepy." Furthermore, the robot will take actions such as playing a gentle lullaby to soothe the baby.

[0699] Thus, the present invention aims to provide prompt and accurate childcare support based on the sounds of a baby crying, thereby reducing the burden on childcare providers.

[0700] An example of a prompt message for the generative AI model would be: "I have recorded my baby crying. Please send this data to the server and generate optimal parenting advice based on the analysis results."

[0701] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0702] Step 1:

[0703] The device constantly monitors ambient sounds using its built-in high-sensitivity microphone, and when it detects a baby crying, it records the sound. This recorded data becomes the input. The recorded audio data is processed as a digital signal and prepared for transmission to the server.

[0704] Step 2:

[0705] The server receives audio data sent from the terminal. The input audio data is analyzed using Google Cloud Speech-to-Text or IBM Watson. Data processing is performed to extract features such as the frequency characteristics, amplitude, and duration of the audio, and the crying pattern is identified based on the results.

[0706] Step 3:

[0707] The server predicts the baby's state (hunger, sleepiness, stress, etc.) by comparing the extracted audio features with past training data recorded by a machine learning algorithm. The data calculation used here is pattern matching using a machine learning algorithm. The predicted state is obtained as the output.

[0708] Step 4:

[0709] The server generates appropriate countermeasures based on the estimated state of the baby. This generation process refers to a pre-entered database of childcare advice and outputs the most suitable message. For example, it might generate specific advice such as, "The baby seems sleepy."

[0710] Step 5:

[0711] Notification information is sent to the device. The device receives the message and provides the information to the caregiver through a display or audio output device. The information is displayed using an intuitive interface to make it easier for the caregiver to check the appropriate course of action.

[0712] Step 6:

[0713] Users (caregivers) take specific caregiving actions based on the advice they receive. For example, they might try to get their baby to sleep or breastfeed. They can provide feedback on the actions they take, which helps improve the accuracy of future analyses.

[0714] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0715] This invention is a system that aims to provide optimal childcare support by considering not only the baby's cries but also the caregiver's emotional state. This system provides comprehensive support to caregivers by integrating voice analysis, emotion recognition, generation of appropriate countermeasures, and notification thereof.

[0716] The device collects the baby's cries using a microphone as usual and sends the recorded data to a server. At the same time, the device is equipped with the user's camera, which captures the user's facial expressions and provides information to the emotion engine.

[0717] The server processes the baby's crying data using a voice analysis algorithm to extract features. This result is then compared with a machine learning model to estimate the baby's state. Additionally, based on the user's facial expressions captured by the camera, an emotion engine analyzes the user's emotions and determines their state (e.g., stress, fatigue, calmness).

[0718] Based on this analysis, the server creates a response optimized for the baby's condition and the user's emotions. For example, if the baby is hungry and the user is tired, it provides simple, quick action plans and a list of recommended baby products.

[0719] The server then sends the generated notification to the device, which provides the user with visual and auditory notifications. The notification content is designed to be easy to understand and act upon, taking into account the user's situation, with the aim of reducing parenting stress.

[0720] As a concrete example, consider a scenario where a baby starts crying at night. The device detects the crying and sends it to the server, while simultaneously capturing the user's facial expression with a camera. If the server determines the baby is "hungry" and the user is "tired," a specific and simple message such as "Feeding is needed. We recommend preparing formula milk" is generated and sent as a notification from the device. This allows caregivers to take appropriate action quickly, reducing their burden.

[0721] The following describes the processing flow.

[0722] Step 1:

[0723] The device records the baby's cries using a microphone and constantly monitors the surrounding sounds. This process is configured to start recording when it detects sounds exceeding a certain volume or frequency.

[0724] Step 2:

[0725] The device uses the user's camera to capture facial expressions. It provides facial recognition technology and collects facial expression data in real time.

[0726] Step 3:

[0727] The device formats the recorded audio and facial expression data appropriately and sends it to the server using a secure protocol.

[0728] Step 4:

[0729] The server receives the audio data and uses an audio analysis algorithm to extract the characteristics of the crying sounds. This analysis evaluates data such as sound patterns, length, and intervals.

[0730] Step 5:

[0731] The server uses a pre-trained machine learning model to analyze the characteristics of the baby's cries and predict its condition. This is then categorized into specific states such as hunger, sleepiness, or illness.

[0732] Step 6:

[0733] The server simultaneously analyzes facial expression data using an emotion engine to determine the user's emotional state (e.g., tired, stressed). It focuses on analyzing the movement of facial muscles and eye movements.

[0734] Step 7:

[0735] The server integrates the baby's estimated state with the user's emotional state to generate the optimal response. This proposal includes specific action instructions tailored to the baby's state and support measures that take the user's emotional state into consideration.

[0736] Step 8:

[0737] The server generates and delivers countermeasures to the terminal. These include not only action instructions but also encouragement and advice that takes into account the user's psychological state.

[0738] Step 9:

[0739] The device sends notifications to the user. These notifications are displayed visually as pop-up messages, and audio and vibration feedback can also be configured.

[0740] Step 10:

[0741] The user checks the notification and takes action based on the suggested response. At this time, the user can input the results of their actions as feedback into their device.

[0742] Step 11:

[0743] The server receives feedback from users and updates the database based on that feedback. This makes it possible to improve the accuracy of future analyses.

[0744] (Example 2)

[0745] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0746] In childcare, quickly determining the cause of a baby's crying and appropriately communicating the necessary response to the caregiver is a challenging task. Furthermore, simply offering general solutions without considering the caregiver's emotional state can increase their burden. This invention aims to provide optimal support by considering not only the baby's condition but also the caregiver's emotional state.

[0747] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0748] In this invention, the server includes means for analyzing speech and extracting features, means for recognizing the emotional state of the caregiver, and means for integrating the aforementioned speech features and the caregiver's emotional state to generate an optimal response. This makes it possible to provide optimal childcare support based on the state of the person being cared for and the emotions of the caregiver.

[0749] An "electronic device" is a device used to collect digital audio data and to record the crying of a child.

[0750] A "data processing device" is a computer device used to analyze collected audio data and extract its characteristics.

[0751] A "central control unit" is a device that has the function of generating optimal childcare responses based on voice characteristics and the emotional state of the caregiver.

[0752] A "display device" is a device that notifies the caregiver of the generated countermeasures visually or audibly.

[0753] "Target of childcare" refers to babies and infants who are the target of support in this invention.

[0754] "The emotional state of a caregiver" refers to the emotional conditions that caregivers experience during childcare, such as stress, fatigue, and relaxation.

[0755] "Optimal response measures" refer to specific and practical instructions and recommendations for childcare that are generated based on the condition of the person being cared for and the emotional state of the caregiver.

[0756] "Training data" refers to information accumulated from past analysis results of speech and emotions, and is a dataset used as a standard or reference in current analysis.

[0757] This invention is a comprehensive system for supporting childcare, providing optimal childcare support by analyzing the baby's cries and the caregiver's emotional state. The system mainly consists of terminals, a server, and the caregiver.

[0758] The device is an electronic device equipped with a microphone and a camera, which is used to record the baby's cries and capture the caregiver's facial expressions. The recorded audio data and captured image data are transmitted to a server in real time.

[0759] The server performs key data processing for speech analysis and emotion recognition. This analysis is carried out by applying speech analysis algorithms to the speech data. It also processes image data received from the camera using emotion recognition software to determine the emotional state of the caregiver. The accuracy of the analysis is improved by comparing the speech characteristics and the caregiver's emotional state with past training data.

[0760] Past training data is accumulated by machine learning algorithms and forms the basis for the system to understand voice and facial expression patterns. Based on this, the server uses a generative AI model to create optimal responses. These responses are then communicated to caregivers as visual or audio messages. The notifications are specific and simple to enable caregivers to respond quickly.

[0761] To give a specific example, if a baby starts crying at night, the device quickly detects the crying and simultaneously assesses the caregiver's facial expression. For instance, if the device determines that the baby is "hungry" and the caregiver is "tired," it generates a message such as, "The baby needs to be fed. Please prepare the formula." This allows caregivers to efficiently resolve childcare challenges.

[0762] Examples of input prompts for the generative AI model include specific instructions such as, "The baby is crying. Based on the audio data and the user's emotions, please provide appropriate childcare solutions." This allows the system to provide accurate childcare support based on a combination of data.

[0763] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0764] Step 1:

[0765] The device uses a microphone to collect the baby's cries in real time. The cries, as input, are converted into digital audio data and stored on the device. Simultaneously, a camera is used to capture the caregiver's facial expressions and record them as image data. This allows for the simultaneous collection of audio and image data.

[0766] Step 2:

[0767] The device transmits the collected audio and image data to the server via the internet. During this transmission process, the data is encrypted to ensure data security. The output is the transmitted audio and image dataset.

[0768] Step 3:

[0769] The server feeds the received audio data to a speech analysis algorithm. Data processing is performed to extract features such as pitch, intensity, and rhythm from the input audio data. This output feature data is then compared with a machine learning model to classify the baby's state into categories such as "hungry," "sleepy," and "uncomfortable."

[0770] Step 4:

[0771] The server analyzes the emotional state of the caregiver using image data obtained from the camera. Emotion recognition software evaluates the movement of facial muscles and facial patterns, and determines emotions such as "stress" and "fatigue" from the input images. This process yields emotional state as feature data.

[0772] Step 5:

[0773] The server integrates the obtained voice characteristics with the caregiver's emotional state and generates the optimal response using a generative AI model. Here, voice data and emotional data are processed as input, and the necessary actions and recommendations are concretized. The output is an action plan optimized for both the baby and the caregiver.

[0774] Step 6:

[0775] The server sends the generated countermeasures to the terminal. The terminal receives this information and notifies the caregiver using the display screen or speaker. The notification includes visual messages and audio guidance to deliver quick and clear instructions to the caregiver.

[0776] Through these steps, the system can process the input data and provide parents with rational and practical solutions to difficult situations in childcare.

[0777] (Application Example 2)

[0778] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0779] In today's childcare environment, where it is difficult to respond appropriately to a baby's crying, the challenge is to alleviate the mental and physical burden on caregivers and provide prompt and effective childcare support. Furthermore, because the emotional state of the caregiver themselves is not taken into consideration, the current support is not truly meaningful for them.

[0780] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0781] In this invention, the server includes acoustic detection means for collecting the baby's voice, visual information acquisition means for capturing the caregiver's facial expressions and analyzing their emotions, and a central processing unit for estimating the baby's and caregiver's conditions based on the aforementioned characteristics and the caregiver's emotions, and for generating countermeasures. This enables comprehensive childcare support that takes into account both the baby's condition and the caregiver's emotions.

[0782] "Acoustic detection means" refers to a sensor device used to collect the sound of a baby crying.

[0783] A "processing device" is a device used to analyze collected audio data and extract its characteristics.

[0784] A "visual information acquisition device" is a device that captures the facial expressions of a caregiver and analyzes their emotional state.

[0785] A "central processing unit" is a computing device that estimates the baby's condition and the caregiver's condition based on audio and visual data, and generates the optimal course of action.

[0786] An "output device" is a device that provides notifications to caregivers visually or audibly.

[0787] This invention relates to a system that analyzes a baby's voice and the caregiver's emotions to provide optimal childcare support. The system comprises an acoustic detection means, a visual information acquisition means, a central processing unit, and an output device.

[0788] First, an acoustic detection system collects the baby's cries. This hardware includes an audio input device such as a high-sensitivity microphone. For visual information acquisition, a camera is used to capture the caregiver's facial expressions. Specific examples include a standard USB camera or a camera built into a smartphone.

[0789] The server analyzes the collected audio data using a processing unit and extracts its features. This process utilizes an audio analysis algorithm to analyze the patterns of the baby's cries. Next, the facial expression data of the caregiver, obtained using visual information acquisition means, is analyzed by emotion recognition software. Specific software used includes "EmotionEngine," among others.

[0790] Based on the analysis results, the central processing unit infers the baby's condition and the caregiver's emotions, and generates the optimal course of action based on this information. This involves matching the data with a database that includes machine learning models. The generated course of action is notified to the caregiver visually or audibly through an output device. Specific examples include push notifications and voice guidance via smartphones or smart glasses.

[0791] This invention makes it possible to provide accurate childcare support that takes into account the baby's crying and the caregiver's condition, thereby reducing the burden on the caregiver.

[0792] Example of a prompt:

[0793] "Enter a Japanese sentence. For example, 'The baby has started crying. The mother appears tired.' Please generate the best parenting advice for this sentence."

[0794] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0795] Step 1:

[0796] The user activates the acoustic detection device near the baby. The acoustic detection device collects the baby's cries through a microphone. The input is real-time audio data, which is converted into digital data and sent to the processing unit as an audio signal. The output is digital audio data to be analyzed.

[0797] Step 2:

[0798] The server analyzes the audio data acquired using a processing unit. It receives digital audio data as input and applies an audio analysis algorithm. By extracting features from the data and analyzing the patterns of the baby's cries, it infers the baby's state (e.g., hungry, sleepy, etc.). The output is information indicating the baby's state.

[0799] Step 3:

[0800] The user activates a visual information acquisition device and captures their facial expressions with a camera. The input is real-time video data, which is sent to the server as a digital image. The output is video data ready for emotion analysis.

[0801] Step 4:

[0802] The server analyzes the user's video data using emotion recognition software. It receives digital video data as input and applies an emotion analysis algorithm. It determines the user's emotional state (e.g., stress, fatigue) and generates information indicating the emotional state as output.

[0803] Step 5:

[0804] The server integrates baby state information obtained from voice analysis and user emotional state information obtained from emotion analysis in the central processing unit. The input consists of both the baby's state and the user's emotional state. This information is compared with a machine learning model in the database to generate the optimal response. The output is a message containing specific childcare support measures to notify the user.

[0805] Step 6:

[0806] The server notifies the user of generated childcare support messages visually or audibly through an output device. The input is the generated message, and the output is the information the user receives. Specific examples include push notifications and voice guidance via smartphones and smart glasses.

[0807] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0808] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (Internet Search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0809] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0810] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0811] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0812] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0813] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0814] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0815] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0816] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values ​​representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values ​​representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0817] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0818] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0819] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0820] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0821] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0822] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0823] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0824] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0825] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0826] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0827] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0828] The following is further disclosed regarding the embodiments described above.

[0829] (Claim 1)

[0830] A computing device for collecting baby sounds,

[0831] A central device for analyzing the aforementioned audio and extracting features,

[0832] A central device for predicting the baby's condition based on the aforementioned characteristics and generating countermeasures,

[0833] A computing device for notifying the person responsible for the generated countermeasures,

[0834] A system that includes this.

[0835] (Claim 2)

[0836] The system according to claim 1, wherein the central device includes means for comparing the characteristics of the voice with a past learning database.

[0837] (Claim 3)

[0838] The system according to claim 1, wherein the notification includes a message containing specific instructions for taking care of the baby.

[0839] "Example 1"

[0840] (Claim 1)

[0841] A means of using a device that collects acoustic data and detects specific sound patterns,

[0842] Means for using a processing device to analyze the aforementioned acoustic data and extract features such as frequency, amplitude, and duration,

[0843] Based on the aforementioned characteristics, a means for inferring the cause of the sound by comparing it with past accumulated data,

[0844] A means of notifying caregivers of information based on the aforementioned suspected cause of the occurrence and instructing them on specific actions,

[0845] A system that includes this.

[0846] (Claim 2)

[0847] The system according to claim 1, wherein the processing device continuously learns and improves the accuracy of speech analysis by utilizing a generated AI model.

[0848] (Claim 3)

[0849] The system according to claim 1, wherein the device has a function to encrypt acoustic data using a secure communication protocol and transfer it to a central device.

[0850] "Application Example 1"

[0851] (Claim 1)

[0852] Information processing means for collecting a baby's voice along with environmental information,

[0853] An information processing device for analyzing the aforementioned audio and extracting features,

[0854] An information processing device for predicting the baby's condition based on the aforementioned characteristics and generating countermeasures,

[0855] Information processing means for notifying a machine operating as a childcare support device of the generated countermeasures,

[0856] A system that includes this.

[0857] (Claim 2)

[0858] The system according to claim 1, wherein the information processing device includes means for comparing the characteristics of the voice with those of a past learning data recording device.

[0859] (Claim 3)

[0860] The system according to claim 1, wherein the notification generates guidance including instructions for actions to take care of a baby.

[0861] "Example 2 of combining an emotion engine"

[0862] (Claim 1)

[0863] Electronic devices for collecting voices of children,

[0864] A data processing device for analyzing the aforementioned audio and extracting features,

[0865] A central control device for generating countermeasures by using the aforementioned voice characteristics to estimate the state of the person being cared for and also considering the emotional state of the caregiver,

[0866] A display device for notifying the person responsible for the generated countermeasures,

[0867] A system that includes this.

[0868] (Claim 2)

[0869] The system according to claim 1, wherein the central control device includes means for comparing the characteristics of the voice and the emotional state of the caregiver with past learning data.

[0870] (Claim 3)

[0871] The system according to claim 1, wherein the notification includes specific instructions for actions to take in providing care for the person being cared for and support messages that take into account the caregiver's condition.

[0872] "Application example 2 when combining with an emotional engine"

[0873] (Claim 1)

[0874] Acoustic detection means for collecting baby sounds,

[0875] A processing device for analyzing the aforementioned audio and extracting features,

[0876] A means of acquiring visual information to capture the facial expressions of caregivers and analyze their emotions,

[0877] A central processing unit for inferring the state of the baby and the state of the caregiver based on the aforementioned characteristics and the caregiver's emotions, and for generating countermeasures,

[0878] An output device for visually or audibly notifying the caregiver of the generated countermeasures,

[0879] A system that includes this.

[0880] (Claim 2)

[0881] The system according to claim 1, wherein the central processing unit includes means for comparing the characteristics of the voice with the emotional information of the caregiver against a past learning database.

[0882] (Claim 3)

[0883] The system according to claim 1, wherein the notification includes a message containing specific action instructions based on the baby's condition and the caregiver's condition. [Explanation of symbols]

[0884] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A computing device for collecting baby sounds, A central device for analyzing the aforementioned audio and extracting features, A central device for predicting the baby's condition based on the aforementioned characteristics and generating countermeasures, A computing device for notifying the person responsible for the generated countermeasures, A system that includes this.

2. The system according to claim 1, wherein the central device includes means for comparing the characteristics of the voice with a past learning database.

3. The system according to claim 1, wherein the notification includes a message containing specific instructions for taking care of the baby.