system

A system that converts spoken content and analyzes images to generate diary data with emotional nuance, addressing the challenge of recording childcare events, ensures easy access and sharing, and reduces the burden on parents.

JP2026105384APending Publication Date: 2026-06-26SOFTBANK GROUP CORP

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
SOFTBANK GROUP CORP
Filing Date
2024-12-16
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Parents face challenges in timely recording daily childcare events and children's growth due to busy schedules, leading to the risk of important moments being forgotten.

Method used

A system that converts spoken content into text data, analyzes images to identify specific events, performs emotion analysis, and generates diary data, all while storing and synchronizing the information across devices for easy access and editing.

Benefits of technology

Enables parents to maintain detailed childcare records without manual effort, preserving memories with emotional nuances and allowing easy access and sharing across devices.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026105384000001_ABST
    Figure 2026105384000001_ABST
Patent Text Reader

Abstract

Provide a system. 【Solution means】 Means for inputting voice, Means for converting the voice into text data, Means for performing natural language generation based on the text data to generate diary data, Means for inputting an image, analyzing the image, and identifying a specific event, Means for analyzing the emotion based on the voice and adding emotion information to the diary data, Means for a mobile body to patrol within a home and record voice and images in real time, Means for transmitting the generated diary data to the cloud, Means for making the diary data accessible from a remote location, A system including the above.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0005]

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, the method including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] In modern society, many parents with children are busy with work and housework, so there is a problem that it is difficult to record daily childcare events and children's growth in a timely manner. This problem causes the problem of the labor of recording and the risk that important moments will disappear from memory someday.

Means for Solving the Problems

[0005] To solve the above problems, the present invention proposes a method for converting spoken content by a user into text data, equipped with a voice input means. Furthermore, it provides a means for creating diary data using natural language generation with the converted text data. It also constructs a system equipped with an image input means that can analyze captured images to identify specific events. Moreover, it provides a means for performing emotion analysis based on voice data and adding emotional information to the generated diary data, thereby enabling richer records. By using these means, it is possible to keep childcare records without any special effort, and the family can look back on the records in the future.

[0006] "Means of inputting voice" refers to a device equipped with the function of recording a user's spoken voice as digital data, or software that performs that function.

[0007] "Methods for converting speech to text data" refers to technologies that analyze input speech and perform the process of converting it into text in a corresponding written format.

[0008] "A means of generating diary data through natural language generation" refers to a technology that uses the converted text data to construct semantically consistent sentences and outputs them as a diary-format document.

[0009] "Means for inputting an image and analyzing the image to identify a specific event" refers to a technology that captures visual data in digital format and analyzes that data to recognize and classify specific actions or important events.

[0010] "Methods for analyzing emotions and adding emotional information to diary data" refers to technologies that extract emotional nuances from audio and image data and reflect the results in text data to create records that include emotional information.

[0011] "Storing data in the cloud and synchronizing it with other devices" refers to a technology that stores generated data on a remote server (cloud) via the internet, ensuring that the data is accessible and up-to-date across multiple devices.

[0012] "Means of providing users with an editable interface for diary data" refers to technology that provides a user-friendly interface that allows users to view generated diary data and make corrections or additions as needed. [Brief explanation of the drawing]

[0013] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12]It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Example 2 when combined with an emotion engine. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when combined with an emotion engine.

Modes for Carrying Out the Invention

[0014] Hereinafter, an example of an embodiment of the system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0015] First, the terms used in the following description will be explained.

[0016] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0017] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0018] In the following embodiments, the signed storage is one or more non-volatile storage devices that store various programs and various parameters. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes.

[0019] In the following embodiments, the signed communication interface (I / F) is an interface that includes a communication processor and an antenna, etc. The communication interface manages communication between multiple computers. Examples of communication standards applicable to the communication interface include wireless communication standards such as 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).

[0020] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0021] [First Embodiment]

[0022] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0023] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0024] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0025] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0026] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0027] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0028] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0029] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0030] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0031] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0032] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0033] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0034] This invention provides a system that automatically creates childcare records using voice and images, eliminating the need for parents to manually record important moments in childcare while allowing them to save and manage detailed data.

[0035] When a user speaks about childcare-related events, the device captures the user's voice using its voice input function. The recorded audio is first acquired as audio data in real time by the device and then sent to the server. This system uses a speech recognition algorithm to convert the audio data into text data.

[0036] After the server converts the audio to text, a natural language generation engine runs to create a diary-style document based on the converted text data. At this stage, grammatical consistency and readability are considered, and the text is refined to be easy for the user to understand when reviewing it later.

[0037] Furthermore, photos related to childcare taken by the user are transferred from the device to the server. The server analyzes these images, performing facial recognition and behavioral recognition to identify specific events. When important moments are recognized, corresponding explanations are added to the diary data.

[0038] The server then performs emotion analysis on the audio data. By analyzing the tone and speed of the voice, emotions such as joy and surprise are reflected in the diary data. This allows users to have richer records with emotional nuances.

[0039] Ultimately, all data is securely stored in the cloud and synchronized across the user's other devices. This allows users to access their diaries from any device and edit them as needed. Based on this, users can review their records on a daily basis and share them with their families.

[0040] For example, if a user says, "Today my child rode a bicycle for the first time," and takes a photo, the natural language generation engine will generate a detailed diary entry such as, "Today, my child rode a bicycle for the first time, and here they are, pedaling with all their might. It was a day when family and friends celebrated with them." This entire process is automated, and the user does not need to do anything special.

[0041] The following describes the processing flow.

[0042] Step 1:

[0043] When a user speaks about events related to childcare, the device activates its voice input function and records the user's voice. The device suppresses ambient noise and saves the recording as clear audio data.

[0044] Step 2:

[0045] The device transfers the recorded audio data to the server. During the transfer, encryption is applied to protect the data, and a secure communication channel is used.

[0046] Step 3:

[0047] The server analyzes the received audio data using a speech recognition system and converts it into linguistic text. This process uses a language model to correct any errors in speech recognition.

[0048] Step 4:

[0049] The server sends text data based on voice input to a natural language generation engine. The engine generates a contextual diary entry based on the text and adjusts it to accurately reflect the user's intent.

[0050] Step 5:

[0051] The device receives images from the user and sends the image data to the server. The images are converted to the appropriate format beforehand and sent while maintaining high resolution.

[0052] Step 6:

[0053] The server uses an image recognition algorithm to analyze the received images and identify people's faces or specific scenes. When important events are extracted, information about them is added to the diary.

[0054] Step 7:

[0055] The server performs sentiment analysis on the voice data. This allows it to determine emotional nuances from the user's tone of voice and phrasing, and incorporate emotional elements into the generated diary.

[0056] Step 8:

[0057] The server saves the final diary data to a cloud database. The data is synchronized with other user devices in the cloud, establishing a state where users can access their diaries from any device.

[0058] Step 9:

[0059] Users can view the generated diary data on their device and edit or add comments as needed. Completed diaries are managed within the app and can be viewed at any time.

[0060] (Example 1)

[0061] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0062] Recording and managing important moments in childcare in detail is a significant burden for parents. In particular, there is a need for a method that allows for easy recording of important events using audio and images, and automatically generates diary-style data organized based on emotions and events. Conventional technologies perform individual voice recognition, image analysis, and emotion analysis independently, lacking a system that integrates these processes. A system is needed to address these challenges and manage childcare records more simply and comprehensively.

[0063] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0064] In this invention, the server includes a device that converts speech into text information, a device that generates text and records based on the text information, a device that analyzes images to identify specific activities, and a device that performs emotion analysis and adds emotional information to the records. This makes it possible to automatically generate detailed and emotionally nuanced childcare records based on speech and images, thereby reducing the burden on parents.

[0065] A "speech input device" refers to an input device that captures a user's speech and converts it into electrical signals.

[0066] A "device that converts to text information" refers to a processing device that analyzes audio signals and generates corresponding text data.

[0067] A "device that generates text and records data" refers to a device that uses natural language processing to create a consistent written record based on input text data.

[0068] A "device that analyzes images to identify specific activities" refers to a processing device that analyzes input image data to identify and recognize specific situations or actions.

[0069] A "device that performs emotion analysis and adds emotional information to recorded data" refers to a device that extracts emotional information from audio and images and reflects it in the recorded data.

[0070] A "remote storage medium" refers to a remote storage system accessible via the internet, used for storing and sharing recorded data.

[0071] "A function that provides an operation screen to the user" refers to a means of providing an interface for the user to view, edit, and manage recorded data.

[0072] This invention illustrates an embodiment of a system that automatically creates childcare records using voice and images. The system primarily involves a terminal, a server, and a user each playing their respective roles, collectively achieving comprehensive record generation.

[0073] The device captures the user's voice using its voice input function. Mobile devices such as smartphones and tablets are used for this purpose. The captured voice is acquired as digital data by the device and transmitted to a server via a communication network. The voice data is then processed on the server.

[0074] The server converts audio data into text data using a speech recognition algorithm. Specifically, it utilizes a common speech recognition service for this conversion. Next, the server uses a natural language generation engine to create a diary document based on this text data. The generative AI model generates text that takes grammatical consistency and readability into consideration. This process allows users to easily obtain a detailed text record from audio.

[0075] Furthermore, photos related to childcare taken by the user are sent to the server via the device. The server analyzes these images and identifies specific events through facial recognition and activity recognition. Modern image recognition tools are used for this image analysis. Descriptions corresponding to the identified events are automatically added to the diary entry.

[0076] In addition, the server performs sentiment analysis using the audio data. Through this analysis, emotions are extracted from the tone and speed of the voice, and these are reflected in the recorded document. Dedicated software for sentiment analysis is used for this purpose.

[0077] Ultimately, all recorded data is securely stored on cloud-based storage. This allows users to access the data from any device and review and edit the records as needed.

[0078] As a concrete example, consider a scenario where a user says, "Today my child learned to ride a bicycle for the first time," and takes a photo. In this case, an example prompt would be "The day my child learned to ride a bicycle for the first time." The generative AI model would then generate a record stating, "Today, my child learned to ride a bicycle for the first time, pedaling with all their might. It was a day celebrated by family and friends." This entire process is fully automated, requiring no special action from the user.

[0079] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0080] Step 1:

[0081] The device captures the user's voice. It takes the user's speech as input via a microphone and generates digital audio data. Since this audio data cannot be used directly as text, it is first sent to the server in digital format.

[0082] Step 2:

[0083] The server processes the received audio data through a speech recognition algorithm. The input is audio data sent from the terminal, which is converted into text data by the speech recognition software. This process analyzes the audio waveform and produces output as a sequence of words.

[0084] Step 3:

[0085] The server uses a generative AI model to convert text data into diary-style entries. The input is text data obtained from speech recognition, and natural language generation technology is used for data processing to output a grammatically consistent and emotionally appropriate record document.

[0086] Step 4:

[0087] The user sends images they have taken from their device to the server. The input is image data acquired by the device. The device compresses the image data into an appropriate format and transfers it to the server via the communication network.

[0088] Step 5:

[0089] The server performs image analysis to identify specific events. The input is image data sent from the terminal, and through data processing using image recognition technology, it extracts features of events and situations to identify specific activities and outputs the results.

[0090] Step 6:

[0091] The server performs emotion analysis based on the audio data. The input is the initial audio data, and emotion analysis software analyzes the tone and speed of the voice, outputting emotional information such as joy, anger, sadness, and happiness, and reflecting the results in the diary document.

[0092] Step 7:

[0093] The server saves the final record data to the cloud and synchronizes it with other devices. The input is the final record document generated on the server, and the output is securely stored in cloud storage. Furthermore, the synchronization process is performed automatically, and users can access the records from various devices.

[0094] (Application Example 1)

[0095] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0096] For many parents, automatically recording important moments in childcare and generating detailed childcare diaries is a cumbersome task. In particular, it is difficult to record every moment of childcare without missing any and to accurately capture and preserve the nuances of emotions. Furthermore, sharing the generated records with other family members requires synchronization and access across different devices. This invention aims to solve these problems when raising children at home and to generate and share childcare records more easily and effectively.

[0097] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0098] In this invention, the server includes means for inputting voice, means for converting the voice into text data, means for generating diary data by performing natural language generation based on the text data, means for inputting images and analyzing the images to identify specific events, means for analyzing emotions based on the voice and adding emotional information to the diary data, means for a mobile device to patrol the home and record voice and images in real time, means for transmitting the generated diary data to the cloud, and means for making the diary data accessible from a remote location. This makes it possible to automatically record important moments during childcare and generate emotionally rich diaries. Furthermore, the recorded data is securely stored via the cloud and can be easily accessed and shared from multiple devices.

[0099] "Means for inputting audio" refers to a device or technology that acquires an audio signal from an external source and prepares it for subsequent processing.

[0100] "Means of converting to text data" refers to the technology of converting acquired audio signals into text information, utilizing speech recognition technology.

[0101] "A means of generating diary data using natural language generation" refers to a technology that automatically generates diary-style data in a human-readable language format based on text data.

[0102] "Means for inputting an image and analyzing the image to identify a specific event" refers to a technology that takes in and analyzes image data to identify specific events or situations occurring within the image.

[0103] "Means for analyzing emotions and adding emotional information to the diary data" refers to a technique that analyzes the characteristics of audio data, estimates the speaker's emotional state, and then adds information related to those emotions to the diary data.

[0104] "A means for a mobile device to patrol a home and record audio and images in real time" refers to a technology in which a device that can move around a home continuously acquires and records audio and image data in real time.

[0105] "Methods for sending generated diary data to the cloud" refers to technologies that save locally generated diary data to a cloud storage system via the internet.

[0106] "Methods for making diary data accessible remotely" refers to technologies that allow diary data stored in the cloud to be safely and easily viewed from other locations, overcoming physical limitations.

[0107] "Means of synchronizing with other computing devices" refers to technologies that maintain data consistency between different devices while updating and sharing the same data in real time.

[0108] "Means of providing a display device to a user" refers to technology that includes a screen or device for the user to view and edit generated data.

[0109] The system to realize this application consists of a childcare support robot that operates continuously in the home, recording and analyzing audio and images during childcare. The server uses a mobile unit equipped with an audio input device and an image capture device to record parent-child interactions in real time.

[0110] The voice input device captures parental speech and child vocalizations and sends them to a server. The server uses speech recognition software to convert the voice data into text data. The "speech_recognition" library is used in this process, and the converted text data is further processed into a diary-style document by a natural language generation engine.

[0111] Next, the server receives image data captured by the image capture device and uses image analysis algorithms to recognize specific events. In this process, facial recognition and behavioral recognition technologies are utilized to identify important moments in childcare. The identified events are then added to the appropriate locations in the diary data.

[0112] The server also performs emotion analysis based on the voice data to estimate the emotional state of the parent or child. Using the "emotion_analysis" module, it analyzes emotions from the tone and speed of the voice and adds that information to the diary data.

[0113] The generated diary data is securely stored in cloud storage. Through the "cloud_storage_module," the data can be accessed from multiple devices, allowing users to refer to their records from anywhere and share them with family members as needed.

[0114] For example, if a user says, "Today my child drew a picture for the first time," and takes a photo of that moment, the system will automatically generate a diary entry such as, "Today my child drew a picture with all their heart for the first time. It was a very touching moment, and the whole family celebrated." This allows users to record precious moments in parenting in detail without having to take any special action.

[0115] Examples of prompts to input into a generative AI model:

[0116] "Identify touching moments related to childcare from audio data and generate emotionally rich entries in diary format."

[0117] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0118] Step 1:

[0119] The server acquires parental speech and child vocalizations in real time from the voice input device installed in the terminal. When an audio signal is input, the server passes the audio signal to speech recognition software, which converts it into text data. At this time, the audio data is sampled, features are extracted, and then it is converted into a string.

[0120] Step 2:

[0121] Based on the converted text data, the server uses a natural language generation engine to generate a diary-style document. Taking the text data as input, and based on a language model, it outputs human-readable text. This creates a diary documenting events related to childcare.

[0122] Step 3:

[0123] The server receives image data acquired from the terminal's image capture device. The input images are processed using image analysis algorithms. The server performs face recognition and behavior recognition to identify specific events. The identified event information is incorporated into the diary data, and appropriate explanations are added.

[0124] Step 4:

[0125] Based on the audio data, the server performs emotion analysis. The audio data is input, and by analyzing its acoustic characteristics, the emotional state of the parent or child is estimated. As a result of the analysis, emotional information such as joy or surprise is generated and added to the diary data.

[0126] Step 5:

[0127] The server uploads the generated diary data to cloud storage. The diary data is then securely transmitted via a cloud API. The diary data stored in the cloud is synchronized and accessible from other devices.

[0128] Step 6:

[0129] Users can access their cloud-based diary data from another device as needed. By starting their device and accessing cloud storage, the saved diary data is downloaded and displayed. This allows users to review and share their childcare records with their family from anywhere.

[0130] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0131] This invention incorporates an emotion engine that analyzes the user's emotions, in addition to a system that automatically creates childcare records using voice and image data. This system allows users to record important moments in childcare and save them, including the emotions they felt at the time, without requiring any special operation.

[0132] When a user speaks about childcare, the device activates its voice input function and records the audio data. The recorded audio data is then transferred by the device to a server. This system uses speech recognition technology to convert the audio data into text, which is then further processed on the server. This text data is processed by a natural language generation engine, and the text is generated in a diary format.

[0133] In addition, the system is equipped with an emotion engine to recognize the user's emotions. Based on voice data, the emotion engine analyzes the user's voice quality, tone, and speed to determine their emotional state. This emotional information is then added to the diary data, allowing the user to reflect on their emotions at that time. Furthermore, the emotion engine extracts intentions and emotions from text data and adds appropriate nuances to the writing.

[0134] For example, if a user says, "I was so happy today because my child rode a bicycle for the first time," the emotion engine extracts the emotions of "joy" and "sense of accomplishment" from that voice and reflects them in the diary data. This generates a diary entry that vividly conveys emotions such as, "Today, my child was able to ride a bicycle by themselves for the first time, and as a parent, I was extremely happy."

[0135] When a user takes a photo with their device, the image data is sent to a server, where an image recognition algorithm identifies important events and people. The image recognition results are incorporated into the diary data as visual elements and saved together with the text.

[0136] All generated data is stored in the cloud and synchronized with other devices. This synchronization allows users to access their diaries from various devices and easily view and edit complete childcare records, including diagrams, photos, and text.

[0137] Thus, the present invention provides a system that allows for the preservation of deeper memories by analyzing and recording the emotional elements in childcare records.

[0138] The following describes the processing flow.

[0139] Step 1:

[0140] When a user speaks about their experiences with childcare, the device activates its voice input function and records the user's voice. The recorded audio data is saved to the device immediately.

[0141] Step 2:

[0142] Once recording is complete, the device transfers the audio data to the server. During data transfer, the audio data is encrypted to protect privacy.

[0143] Step 3:

[0144] The server receives the audio data, which is then analyzed by a speech recognition module to convert it into appropriate text data. Acoustic and linguistic models are used during this process to ensure accurate text transcription.

[0145] Step 4:

[0146] The server passes text data to a natural language generation engine, which then generates meaningful sentences in a diary format. This process constructs sentences that include contextually relevant content, ensuring the text is easily readable.

[0147] Step 5:

[0148] The server analyzes the received audio data using an emotion engine to identify emotions from the user's voice tone and speaking style. The identified emotion information is then added as emotion labels to the naturally generated diary data.

[0149] Step 6:

[0150] When a user takes a photo while taking care of their child, the device sends the image data to the server. Before uploading, the photos undergo moderate compression to optimize their size while ensuring high quality.

[0151] Step 7:

[0152] The server uses image recognition technology to analyze the transmitted images. Here, it recognizes specific events and situations through facial recognition and object detection, and organizes the content deemed important to reflect in the diary.

[0153] Step 8:

[0154] The server generates the final diary data, saving the included text, images, and sentiment information as a single document to the cloud. During this process, data integrity is verified, and the data is synchronized in real time across the user's other devices.

[0155] Step 9:

[0156] Users can view diary data generated from any device. The application provides users with a user-friendly interface, allowing them to add comments and edit records as needed.

[0157] (Example 2)

[0158] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0159] Keeping a record of child-rearing typically requires parents to manually write diaries or compile photos into albums, which is time-consuming and laborious. Furthermore, it's difficult to reflect emotions and feelings in these records in real time. Additionally, easily accessing and editing the generated records across multiple devices is challenging.

[0160] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0161] In this invention, the server includes a device for inputting voice, a device for converting the voice into text data, and a device for generating recorded data by performing natural language generation based on the text data. This allows users to record important moments in childcare through voice input without having to manually enter detailed information themselves, and to automatically generate records that accurately reflect emotions and content. Furthermore, this data is managed on the cloud, enabling easy synchronization and editing across various devices.

[0162] A "speech input device" is a device that has the function of capturing a user's speech as a digital signal.

[0163] A "device that converts to text data" is a device that analyzes audio signals and generates corresponding strings of characters.

[0164] A "device that generates natural language and records data" is a device that automatically creates human-readable text based on character data and saves it as a record.

[0165] A "device that analyzes images to identify specific events" is a device that analyzes input image data and has the function of identifying important events or elements from it.

[0166] A "device that analyzes emotions and adds emotional information to the recorded data" is a device that extracts emotions from voice or text data and adds them to the recorded data, thereby giving the content emotional nuances.

[0167] A "device that enhances recorded data as a visual element" is a device that has the function of making information recorded using image data more visually appealing and richer in content.

[0168] An "information management infrastructure" is a network infrastructure for safely and efficiently storing and managing digital data.

[0169] An "information processing terminal" is an electronic device capable of creating, editing, saving, and transmitting recorded data.

[0170] A "user interface" is software or hardware that provides screens and operating methods used by users to access and manipulate digital data.

[0171] This invention is a system for automatically generating childcare records, utilizing voice input, emotion analysis, and image recognition technologies. This system allows users to record daily childcare moments and save them, including the emotions associated with them, without requiring any special operation.

[0172] The system works as follows: When a user speaks about childcare, the device uses an audio input device to record the audio as a digital signal. Information processing terminals such as smartphones and tablets are used for this purpose. The recorded audio data is transferred from the device to the server via a secure communication protocol.

[0173] The server converts the received audio data into text data using speech recognition software. Specifically, a general speech recognition API is used. This text data is processed by a natural language generation engine and generated as recorded data. A general generative AI model is applied to the natural language generation engine used here.

[0174] Furthermore, the server uses an emotion engine to analyze emotions from the audio data. This process determines the user's emotions based on factors such as voice quality, tone, and speed, and adds this information to the recorded data. Emotions are also extracted from text data, and appropriate nuances are added to the text.

[0175] When a user takes childcare-related photos with their device, the image data is sent to a server. The server uses an image recognition algorithm to identify important events and people. The identified information is incorporated into the recorded data as visual information and stored together with text.

[0176] All data is stored in the cloud and synchronized across different devices. This allows users to easily access, view, and edit recorded data from their home PC or devices while on the go.

[0177] For example, if a user says, "I was so happy today because my child rode a bicycle for the first time," the emotion engine extracts the emotions of "joy" and "accomplishment" from this audio and reflects them in the recorded data. This generates emotionally rich sentences such as, "Today, my child rode a bicycle by themselves for the first time. I was so happy."

[0178] An example of a prompt message is: "Audio data: Today my child was happy to ride a bicycle for the first time. Emotion: Joy, sense of accomplishment. Please summarize this in a diary entry."

[0179] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0180] Step 1:

[0181] When a user begins speaking about childcare, the device uses its voice input device to record the audio as a digital signal in real time. The input data is the user's voice, which is then recorded and output as digital audio data. Smartphones and tablets are useful for this process.

[0182] Step 2:

[0183] The terminal transfers recorded audio data to the server using a secure communication protocol. The input here is digitized audio data, and the output is the secure transmission of data to the server. Appropriate encryption technology is used to ensure that the audio data reliably reaches the server.

[0184] Step 3:

[0185] The server converts the received audio data into text data using speech recognition software. The input is the audio data stored on the server, and the output is the text data generated by speech recognition. In this step, the audio signal is analyzed using a speech recognition API and the corresponding text is generated.

[0186] Step 4:

[0187] The server processes the generated text data using a natural language generation engine and generates conversational text as recorded data. The input is text data generated by speech recognition, and the output is text in natural language format. Text generation based on prompt sentences is performed using a generation AI model.

[0188] Step 5:

[0189] The server uses an emotion engine to analyze emotions from audio data and adds emotional information to the recorded data. The input here is audio and text data, and the output is recorded data with added emotional information. By analyzing the voice tone and speed, the server identifies the user's current emotions and reflects them in the recording.

[0190] Step 6:

[0191] The user sends image data captured with their device to the server. The input is the captured image data, and the output is the transmission of data to the server. This is done easily and quickly because it utilizes cloud-based data sharing.

[0192] Step 7:

[0193] The server uses image recognition algorithms to analyze transmitted image data and identify specific events or important individuals. The input is image data, and the output is recognized event information. Image content is filtered, tagged, and integrated into the recorded data.

[0194] Step 8:

[0195] The server stores all data in the cloud and synchronizes it with different data processing terminals. Input is the final recorded data, and output is storage in the cloud and data synchronization to different terminals. This allows users to access and edit data from anywhere.

[0196] (Application Example 2)

[0197] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0198] In childcare, there is a need for parents to record important moments without missing them and to reflect their emotions and thoughts in real time. Furthermore, there is a need for systems that reduce the emotional and cognitive burden on parents and support the childcare process. Conventional technologies have limited capabilities for automatic generation of childcare records and emotion analysis, failing to fully utilize the potential of multi-functional digital devices.

[0199] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0200] In this invention, the server includes means for inputting voice, means for converting voice into text data, means for generating diary data by performing natural language generation based on the text data, means for inputting images and analyzing images to identify specific events, means for analyzing emotions based on voice and adding emotional information to the diary data, means for analyzing the overall nature of the diary data and providing emotional and cognitive support using a new information processing device, and means for automatically generating records using multiple digital devices and saving childcare records from multiple perspectives. As a result, parents can richly record moments of childcare and confidently look back on emotions and events without requiring any special operation.

[0201] "Means of inputting voice" refers to a function used by an information processing device to accurately acquire voice data.

[0202] "Means of converting to text data" refers to the process of converting acquired audio data into text format.

[0203] "A means of generating diary data by performing natural language generation" refers to a technology that automatically creates a structured diary using natural language processing based on text data.

[0204] "A means of inputting an image, analyzing the image, and identifying a specific event" refers to an algorithm that accurately recognizes image data and identifies the events or situations captured within it.

[0205] "A means of analyzing emotions based on voice and adding emotional information to diary data" refers to the process of extracting emotions from voice data and integrating them into diary data.

[0206] A "new information processing device" is a device designed using the latest technology and possesses high data processing capabilities.

[0207] "A means of automatically generating records using digital devices and saving childcare records from multiple perspectives" refers to a method of automatically collecting and saving childcare-related information using multiple electronic devices.

[0208] This invention is a system that automatically generates childcare records and reduces the emotional and cognitive burden on users. This system integrates a voice input device, a voice processing server, an image recognition device, a cloud synchronization system, and an emotion analysis engine to support parents.

[0209] The user inputs conversations about childcare using a device equipped with speech recognition capabilities. The device sends the voice data to a cloud server, which converts the voice into text data using the Google® Cloud Speech-to-Text API. The resulting text data is then processed using OpenAI®'s GPT model to generate a natural language, creating an emotionally rich childcare diary.

[0210] During this process, the Microsoft® Azure® Text Analytics API analyzes the voice tone and speed to extract the user's emotions. This emotional information is integrated into the diary data, allowing the user to later reflect on their feelings at that time. Additionally, photos taken by the user are analyzed by Amazon Rekognition to identify specific events and individuals. This ensures that the record includes visual elements as well.

[0211] All generated data is stored in the cloud using Firebase and can be synchronized with other digital devices. This allows users to view and edit their parenting records from various devices. The system enriches parents' parenting experience by recording daily moments of parenting in real time and providing emotional support.

[0212] For example, if a user voice-inputs "Today my child rode a bicycle for the first time," the system can analyze the content, extract "joy" and "sense of accomplishment" using its emotion engine, and reflect them in the diary. An example of a prompt would be, "My 3-year-old child sang a song by themselves for the first time. Please generate a record that reflects the joy and surprise the parent felt at that moment."

[0213] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0214] Step 1:

[0215] The user inputs voice using a device. The input voice data is temporarily stored on the device and then sent to a cloud server for speech recognition. Here, the voice is captured as data without any user intervention.

[0216] Step 2:

[0217] The server uses the Google Cloud Speech-to-Text API to convert the transmitted audio data into text data. It parses the audio data and converts its content into structured text format. The input here is audio data, and the output is text data.

[0218] Step 3:

[0219] The server inputs the acquired text data into an OpenAI GPT model and performs natural language generation. The generative AI model generates detailed and emotionally rich diary-style entries based on the input text. Here, the input is text data, and the output is diary entries.

[0220] Step 4:

[0221] The server uses the Microsoft Azure Text Analytics API to perform sentiment analysis on the audio data, extracting emotions from the user's tone and speed of voice. This analyzes the emotional state, and the resulting emotional information is directly added to the diary data. The input is audio data, and the output is data containing emotional information.

[0222] Step 5:

[0223] The user takes an image with their device, and the image data is sent to a cloud server. A server using Amazon Rekognition analyzes the image to identify specific events or people. The analysis results are converted into text format and integrated as visual information into the diary data. The input is image data, and the output is the analyzed information used in the diary.

[0224] Step 6:

[0225] The server stores the generated diary data using Firebase cloud storage and synchronizes it with other digital devices as needed. This allows users to view and edit their childcare records from any device. The input is all the data that makes up the diary, and the output is the diary data stored in cloud storage.

[0226] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0227] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0228] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0229] [Second Embodiment]

[0230] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0231] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0232] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0233] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0234] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0235] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0236] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0237] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0238] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0239] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0240] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0241] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0242] This invention provides a system that automatically creates childcare records using voice and images, eliminating the need for parents to manually record important moments in childcare while allowing them to save and manage detailed data.

[0243] When a user speaks about childcare-related events, the device captures the user's voice using its voice input function. The recorded audio is first acquired as audio data in real time by the device and then sent to the server. This system uses a speech recognition algorithm to convert the audio data into text data.

[0244] After the server converts the audio to text, a natural language generation engine runs to create a diary-style document based on the converted text data. At this stage, grammatical consistency and readability are considered, and the text is refined to be easy for the user to understand when reviewing it later.

[0245] Furthermore, photos related to childcare taken by the user are transferred from the device to the server. The server analyzes these images, performing facial recognition and behavioral recognition to identify specific events. When important moments are recognized, corresponding explanations are added to the diary data.

[0246] The server then performs emotion analysis on the audio data. By analyzing the tone and speed of the voice, emotions such as joy and surprise are reflected in the diary data. This allows users to have richer records with emotional nuances.

[0247] Ultimately, all data is securely stored in the cloud and synchronized across the user's other devices. This allows users to access their diaries from any device and edit them as needed. Based on this, users can review their records on a daily basis and share them with their families.

[0248] For example, if a user says, "Today my child rode a bicycle for the first time," and takes a photo, the natural language generation engine will generate a detailed diary entry such as, "Today, my child rode a bicycle for the first time, and here they are, pedaling with all their might. It was a day when family and friends celebrated with them." This entire process is automated, and the user does not need to do anything special.

[0249] The following describes the processing flow.

[0250] Step 1:

[0251] When a user speaks about events related to childcare, the device activates its voice input function and records the user's voice. The device suppresses ambient noise and saves the recording as clear audio data.

[0252] Step 2:

[0253] The device transfers the recorded audio data to the server. During the transfer, encryption is applied to protect the data, and a secure communication channel is used.

[0254] Step 3:

[0255] The server analyzes the received audio data using a speech recognition system and converts it into linguistic text. This process uses a language model to correct any errors in speech recognition.

[0256] Step 4:

[0257] The server sends text data based on voice input to a natural language generation engine. The engine generates a contextual diary entry based on the text and adjusts it to accurately reflect the user's intent.

[0258] Step 5:

[0259] The device receives images from the user and sends the image data to the server. The images are converted to the appropriate format beforehand and sent while maintaining high resolution.

[0260] Step 6:

[0261] The server uses an image recognition algorithm to analyze the received images and identify people's faces or specific scenes. When important events are extracted, information about them is added to the diary.

[0262] Step 7:

[0263] The server performs sentiment analysis on the voice data. This allows it to determine emotional nuances from the user's tone of voice and phrasing, and incorporate emotional elements into the generated diary.

[0264] Step 8:

[0265] The server saves the final diary data to a cloud database. The data is synchronized with other user devices in the cloud, establishing a state where users can access their diaries from any device.

[0266] Step 9:

[0267] Users can view the generated diary data on their device and edit or add comments as needed. Completed diaries are managed within the app and can be viewed at any time.

[0268] (Example 1)

[0269] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0270] Recording and managing important moments in childcare in detail is a significant burden for parents. In particular, there is a need for a method that allows for easy recording of important events using audio and images, and automatically generates diary-style data organized based on emotions and events. Conventional technologies perform individual voice recognition, image analysis, and emotion analysis independently, lacking a system that integrates these processes. A system is needed to address these challenges and manage childcare records more simply and comprehensively.

[0271] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0272] In this invention, the server includes a device that converts speech into text information, a device that generates text and records based on the text information, a device that analyzes images to identify specific activities, and a device that performs emotion analysis and adds emotional information to the records. This makes it possible to automatically generate detailed and emotionally nuanced childcare records based on speech and images, thereby reducing the burden on parents.

[0273] A "speech input device" refers to an input device that captures a user's speech and converts it into electrical signals.

[0274] A "device that converts to text information" refers to a processing device that analyzes audio signals and generates corresponding text data.

[0275] A "device that generates text and records data" refers to a device that uses natural language processing to create a consistent written record based on input text data.

[0276] A "device that analyzes images to identify specific activities" refers to a processing device that analyzes input image data to identify and recognize specific situations or actions.

[0277] A "device that performs emotion analysis and adds emotional information to recorded data" refers to a device that extracts emotional information from audio and images and reflects it in the recorded data.

[0278] A "remote storage medium" refers to a remote storage system accessible via the internet, used for storing and sharing recorded data.

[0279] "A function that provides an operation screen to the user" refers to a means of providing an interface for the user to view, edit, and manage recorded data.

[0280] This invention illustrates an embodiment of a system that automatically creates childcare records using voice and images. The system primarily involves a terminal, a server, and a user each playing their respective roles, collectively achieving comprehensive record generation.

[0281] The device captures the user's voice using its voice input function. Mobile devices such as smartphones and tablets are used for this purpose. The captured voice is acquired as digital data by the device and transmitted to a server via a communication network. The voice data is then processed on the server.

[0282] The server uses a speech recognition algorithm to convert speech data into text data. Specifically, this conversion is performed by leveraging a general speech recognition service. Next, the server uses a natural language generation engine to create a diary document based on this text data. A text with grammatical consistency and readability considered is generated by a generative AI model. Through this process, the user can easily obtain a detailed text record from speech.

[0283] Furthermore, photos related to child-rearing taken by the user are sent to the server through the terminal. The server analyzes these images and identifies specific events through face recognition and activity recognition. Modern image recognition tools are used for this image analysis. An explanation corresponding to the identified event is automatically added to the diary document.

[0284] In addition, the server performs sentiment analysis using the speech data. Through the analysis, sentiment is extracted from the tone and speed of the speech and reflected in the record document. For this purpose, dedicated software for sentiment analysis is used.

[0285] Finally, all the recorded data is securely stored in a storage medium on the cloud. Thereby, the user can access the data from any terminal and view and edit the records as needed.

[0286] As a specific example, consider the case where the user says "Today, my child rode a bicycle for the first time" and takes a photo. In this case, an example of the prompt text is "The day when my child learned to ride a bicycle for the first time" is used. Then the generative AI model generates a record with the content "Today, my child rode a bicycle for the first time and was pedaling hard. It was a day when family and friends also blessed us." This series of processes is completely automated and the user does not need any special operations.

[0287] The flow of the specific process in Example 1 will be described using FIG. 11.

[0288] Step 1:

[0289] The device captures the user's voice. It takes the user's speech as input via a microphone and generates digital audio data. Since this audio data cannot be used directly as text, it is first sent to the server in digital format.

[0290] Step 2:

[0291] The server processes the received audio data through a speech recognition algorithm. The input is audio data sent from the terminal, which is converted into text data by the speech recognition software. This process analyzes the audio waveform and produces output as a sequence of words.

[0292] Step 3:

[0293] The server uses a generative AI model to convert text data into diary-style entries. The input is text data obtained from speech recognition, and natural language generation technology is used for data processing to output a grammatically consistent and emotionally appropriate record document.

[0294] Step 4:

[0295] The user sends images they have taken from their device to the server. The input is image data acquired by the device. The device compresses the image data into an appropriate format and transfers it to the server via the communication network.

[0296] Step 5:

[0297] The server performs image analysis to identify specific events. The input is image data sent from the terminal, and through data processing using image recognition technology, it extracts features of events and situations to identify specific activities and outputs the results.

[0298] Step 6:

[0299] The server performs emotion analysis based on the audio data. The input is the initial audio data, and emotion analysis software analyzes the tone and speed of the voice, outputting emotional information such as joy, anger, sadness, and happiness, and reflecting the results in the diary document.

[0300] Step 7:

[0301] The server saves the final record data to the cloud and synchronizes it with other devices. The input is the final record document generated on the server, and the output is securely stored in cloud storage. Furthermore, the synchronization process is performed automatically, and users can access the records from various devices.

[0302] (Application Example 1)

[0303] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0304] For many parents, automatically recording important moments in childcare and generating detailed childcare diaries is a cumbersome task. In particular, it is difficult to record every moment of childcare without missing any and to accurately capture and preserve the nuances of emotions. Furthermore, sharing the generated records with other family members requires synchronization and access across different devices. This invention aims to solve these problems when raising children at home and to generate and share childcare records more easily and effectively.

[0305] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0306] In this invention, the server includes means for inputting voice, means for converting the voice into text data, means for performing natural language generation based on the text data to generate diary data, means for inputting an image, analyzing the image to identify a specific event, means for analyzing the emotion based on the voice and adding emotion information to the diary data, means for a mobile body to patrol within a home and record voice and images in real time, means for transmitting the generated diary data to the cloud, and means for enabling access to the diary data from a remote location. Thereby, it becomes possible to automatically record important moments during child-rearing and generate a rich-emotion diary. Also, the recorded data is safely stored via the cloud, facilitating access and sharing from multiple terminals.

[0307] The "means for inputting voice" is a device or technology that acquires an external voice signal and prepares it for subsequent processing.

[0308] The "means for converting into text data" is a technology that converts the acquired voice signal into character information and utilizes voice recognition technology.

[0309] The "means for performing natural language generation to generate diary data" is a technology that automatically generates diary-form data in a language form that is easy for humans to understand based on text data.

[0310] The "means for inputting an image, analyzing the image to identify a specific event" is a technology that captures and analyzes image data to identify a specific event or situation occurring within the image.

[0311] The "means for analyzing emotion and adding emotion information to the diary data" is a technology that analyzes the characteristics of voice data, estimates the emotional state of the speaker, and then adds information regarding that emotion to the diary data.

[0312] The "means for a mobile body to patrol within a home and record voice and images in real time" is a technology that enables a device capable of moving within a home to constantly acquire voice and image data and record it in real time.

[0313] "Methods for sending generated diary data to the cloud" refers to technologies that save locally generated diary data to a cloud storage system via the internet.

[0314] "Methods for making diary data accessible remotely" refers to technologies that allow diary data stored in the cloud to be safely and easily viewed from other locations, overcoming physical limitations.

[0315] "Means of synchronizing with other computing devices" refers to technologies that maintain data consistency between different devices while updating and sharing the same data in real time.

[0316] "Means of providing a display device to a user" refers to technology that includes a screen or device for the user to view and edit generated data.

[0317] The system to realize this application consists of a childcare support robot that operates continuously in the home, recording and analyzing audio and images during childcare. The server uses a mobile unit equipped with an audio input device and an image capture device to record parent-child interactions in real time.

[0318] The voice input device captures parental speech and child vocalizations and sends them to a server. The server uses speech recognition software to convert the voice data into text data. The "speech_recognition" library is used in this process, and the converted text data is further processed into a diary-style document by a natural language generation engine.

[0319] Next, the server receives image data captured by the image capture device and uses image analysis algorithms to recognize specific events. In this process, facial recognition and behavioral recognition technologies are utilized to identify important moments in childcare. The identified events are then added to the appropriate locations in the diary data.

[0320] The server also performs emotion analysis based on the voice data to estimate the emotional state of the parent or child. Using the "emotion_analysis" module, it analyzes emotions from the tone and speed of the voice and adds that information to the diary data.

[0321] The generated diary data is securely stored in cloud storage. Through the "cloud_storage_module," the data can be accessed from multiple devices, allowing users to refer to their records from anywhere and share them with family members as needed.

[0322] For example, if a user says, "Today my child drew a picture for the first time," and takes a photo of that moment, the system will automatically generate a diary entry such as, "Today my child drew a picture with all their heart for the first time. It was a very touching moment, and the whole family celebrated." This allows users to record precious moments in parenting in detail without having to take any special action.

[0323] Examples of prompts to input into a generative AI model:

[0324] "Identify touching moments related to childcare from audio data and generate emotionally rich entries in diary format."

[0325] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0326] Step 1:

[0327] The server acquires parental speech and child vocalizations in real time from the voice input device installed in the terminal. When an audio signal is input, the server passes the audio signal to speech recognition software, which converts it into text data. At this time, the audio data is sampled, features are extracted, and then it is converted into a string.

[0328] Step 2:

[0329] Based on the converted text data, the server uses a natural language generation engine to generate a diary-style document. Taking the text data as input, and based on a language model, it outputs human-readable text. This creates a diary documenting events related to childcare.

[0330] Step 3:

[0331] The server receives image data acquired from the terminal's image capture device. The input images are processed using image analysis algorithms. The server performs face recognition and behavior recognition to identify specific events. The identified event information is incorporated into the diary data, and appropriate explanations are added.

[0332] Step 4:

[0333] Based on the audio data, the server performs emotion analysis. The audio data is input, and by analyzing its acoustic characteristics, the emotional state of the parent or child is estimated. As a result of the analysis, emotional information such as joy or surprise is generated and added to the diary data.

[0334] Step 5:

[0335] The server uploads the generated diary data to cloud storage. The diary data is then securely transmitted via a cloud API. The diary data stored in the cloud is synchronized and accessible from other devices.

[0336] Step 6:

[0337] Users can access their cloud-based diary data from another device as needed. By starting their device and accessing cloud storage, the saved diary data is downloaded and displayed. This allows users to review and share their childcare records with their family from anywhere.

[0338] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0339] This invention incorporates an emotion engine that analyzes the user's emotions, in addition to a system that automatically creates childcare records using voice and image data. This system allows users to record important moments in childcare and save them, including the emotions they felt at the time, without requiring any special operation.

[0340] When a user speaks about childcare, the device activates its voice input function and records the audio data. The recorded audio data is then transferred by the device to a server. This system uses speech recognition technology to convert the audio data into text, which is then further processed on the server. This text data is processed by a natural language generation engine, and the text is generated in a diary format.

[0341] In addition, the system is equipped with an emotion engine to recognize the user's emotions. Based on voice data, the emotion engine analyzes the user's voice quality, tone, and speed to determine their emotional state. This emotional information is then added to the diary data, allowing the user to reflect on their emotions at that time. Furthermore, the emotion engine extracts intentions and emotions from text data and adds appropriate nuances to the writing.

[0342] For example, if a user says, "I was so happy today because my child rode a bicycle for the first time," the emotion engine extracts the emotions of "joy" and "sense of accomplishment" from that voice and reflects them in the diary data. This generates a diary entry that vividly conveys emotions such as, "Today, my child was able to ride a bicycle by themselves for the first time, and as a parent, I was extremely happy."

[0343] When a user takes a photo with their device, the image data is sent to a server, where an image recognition algorithm identifies important events and people. The image recognition results are incorporated into the diary data as visual elements and saved together with the text.

[0344] All generated data is stored in the cloud and synchronized with other devices. This synchronization allows users to access their diaries from various devices and easily view and edit complete childcare records, including diagrams, photos, and text.

[0345] Thus, the present invention provides a system that allows for the preservation of deeper memories by analyzing and recording the emotional elements in childcare records.

[0346] The following describes the processing flow.

[0347] Step 1:

[0348] When a user speaks about their experiences with childcare, the device activates its voice input function and records the user's voice. The recorded audio data is saved to the device immediately.

[0349] Step 2:

[0350] Once recording is complete, the device transfers the audio data to the server. During data transfer, the audio data is encrypted to protect privacy.

[0351] Step 3:

[0352] The server receives the audio data, which is then analyzed by a speech recognition module to convert it into appropriate text data. Acoustic and linguistic models are used during this process to ensure accurate text transcription.

[0353] Step 4:

[0354] The server passes text data to a natural language generation engine, which then generates meaningful sentences in a diary format. This process constructs sentences that include contextually relevant content, ensuring the text is easily readable.

[0355] Step 5:

[0356] The server analyzes the received audio data using an emotion engine to identify emotions from the user's voice tone and speaking style. The identified emotion information is then added as emotion labels to the naturally generated diary data.

[0357] Step 6:

[0358] When a user takes a photo while taking care of their child, the device sends the image data to the server. Before uploading, the photos undergo moderate compression to optimize their size while ensuring high quality.

[0359] Step 7:

[0360] The server uses image recognition technology to analyze the transmitted images. Here, it recognizes specific events and situations through facial recognition and object detection, and organizes the content deemed important to reflect in the diary.

[0361] Step 8:

[0362] The server generates the final diary data, saving the included text, images, and sentiment information as a single document to the cloud. During this process, data integrity is verified, and the data is synchronized in real time across the user's other devices.

[0363] Step 9:

[0364] Users can view diary data generated from any device. The application provides users with a user-friendly interface, allowing them to add comments and edit records as needed.

[0365] (Example 2)

[0366] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0367] Keeping a record of child-rearing typically requires parents to manually write diaries or compile photos into albums, which is time-consuming and laborious. Furthermore, it's difficult to reflect emotions and feelings in these records in real time. Additionally, easily accessing and editing the generated records across multiple devices is challenging.

[0368] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0369] In this invention, the server includes a device for inputting voice, a device for converting the voice into text data, and a device for generating recorded data by performing natural language generation based on the text data. This allows users to record important moments in childcare through voice input without having to manually enter detailed information themselves, and to automatically generate records that accurately reflect emotions and content. Furthermore, this data is managed on the cloud, enabling easy synchronization and editing across various devices.

[0370] A "speech input device" is a device that has the function of capturing a user's speech as a digital signal.

[0371] A "device that converts to text data" is a device that analyzes audio signals and generates corresponding strings of characters.

[0372] A "device that generates natural language and records data" is a device that automatically creates human-readable text based on character data and saves it as a record.

[0373] A "device that analyzes images to identify specific events" is a device that analyzes input image data and has the function of identifying important events or elements from it.

[0374] A "device that analyzes emotions and adds emotional information to the recorded data" is a device that extracts emotions from voice or text data and adds them to the recorded data, thereby giving the content emotional nuances.

[0375] A "device that enhances recorded data as a visual element" is a device that has the function of making information recorded using image data more visually appealing and richer in content.

[0376] An "information management infrastructure" is a network infrastructure for safely and efficiently storing and managing digital data.

[0377] An "information processing terminal" is an electronic device capable of creating, editing, saving, and transmitting recorded data.

[0378] A "user interface" is software or hardware that provides screens and operating methods used by users to access and manipulate digital data.

[0379] This invention is a system for automatically generating childcare records, utilizing voice input, emotion analysis, and image recognition technologies. This system allows users to record daily childcare moments and save them, including the emotions associated with them, without requiring any special operation.

[0380] The system works as follows: When a user speaks about childcare, the device uses an audio input device to record the audio as a digital signal. Information processing terminals such as smartphones and tablets are used for this purpose. The recorded audio data is transferred from the device to the server via a secure communication protocol.

[0381] The server converts the received audio data into text data using speech recognition software. Specifically, a general speech recognition API is used. This text data is processed by a natural language generation engine and generated as recorded data. A general generative AI model is applied to the natural language generation engine used here.

[0382] Furthermore, the server uses an emotion engine to analyze emotions from the audio data. This process determines the user's emotions based on factors such as voice quality, tone, and speed, and adds this information to the recorded data. Emotions are also extracted from text data, and appropriate nuances are added to the text.

[0383] When a user takes childcare-related photos with their device, the image data is sent to a server. The server uses an image recognition algorithm to identify important events and people. The identified information is incorporated into the recorded data as visual information and stored together with text.

[0384] All data is stored in the cloud and synchronized across different devices. This allows users to easily access, view, and edit recorded data from their home PC or devices while on the go.

[0385] For example, if a user says, "I was so happy today because my child rode a bicycle for the first time," the emotion engine extracts the emotions of "joy" and "accomplishment" from this audio and reflects them in the recorded data. This generates emotionally rich sentences such as, "Today, my child rode a bicycle by themselves for the first time. I was so happy."

[0386] An example of a prompt message is: "Audio data: Today my child was happy to ride a bicycle for the first time. Emotion: Joy, sense of accomplishment. Please summarize this in a diary entry."

[0387] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0388] Step 1:

[0389] When a user begins speaking about childcare, the device uses its voice input device to record the audio as a digital signal in real time. The input data is the user's voice, which is then recorded and output as digital audio data. Smartphones and tablets are useful for this process.

[0390] Step 2:

[0391] The terminal transfers recorded audio data to the server using a secure communication protocol. The input here is digitized audio data, and the output is the secure transmission of data to the server. Appropriate encryption technology is used to ensure that the audio data reliably reaches the server.

[0392] Step 3:

[0393] The server converts the received audio data into text data using speech recognition software. The input is the audio data stored on the server, and the output is the text data generated by speech recognition. In this step, the audio signal is analyzed using a speech recognition API and the corresponding text is generated.

[0394] Step 4:

[0395] The server processes the generated text data using a natural language generation engine and generates conversational text as recorded data. The input is text data generated by speech recognition, and the output is text in natural language format. Text generation based on prompt sentences is performed using a generation AI model.

[0396] Step 5:

[0397] The server uses an emotion engine to analyze emotions from audio data and adds emotional information to the recorded data. The input here is audio and text data, and the output is recorded data with added emotional information. By analyzing the voice tone and speed, the server identifies the user's current emotions and reflects them in the recording.

[0398] Step 6:

[0399] The user sends image data captured with their device to the server. The input is the captured image data, and the output is the transmission of data to the server. This is done easily and quickly because it utilizes cloud-based data sharing.

[0400] Step 7:

[0401] The server uses image recognition algorithms to analyze transmitted image data and identify specific events or important individuals. The input is image data, and the output is recognized event information. Image content is filtered, tagged, and integrated into the recorded data.

[0402] Step 8:

[0403] The server stores all data in the cloud and synchronizes it with different data processing terminals. Input is the final recorded data, and output is storage in the cloud and data synchronization to different terminals. This allows users to access and edit data from anywhere.

[0404] (Application Example 2)

[0405] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0406] In childcare, there is a need for parents to record important moments without missing them and to reflect their emotions and thoughts in real time. Furthermore, there is a need for systems that reduce the emotional and cognitive burden on parents and support the childcare process. Conventional technologies have limited capabilities for automatic generation of childcare records and emotion analysis, failing to fully utilize the potential of multi-functional digital devices.

[0407] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0408] In this invention, the server includes means for inputting voice, means for converting voice into text data, means for generating diary data by performing natural language generation based on the text data, means for inputting images and analyzing images to identify specific events, means for analyzing emotions based on voice and adding emotional information to the diary data, means for analyzing the overall nature of the diary data and providing emotional and cognitive support using a new information processing device, and means for automatically generating records using multiple digital devices and saving childcare records from multiple perspectives. As a result, parents can richly record moments of childcare and confidently look back on emotions and events without requiring any special operation.

[0409] "Means of inputting voice" refers to a function used by an information processing device to accurately acquire voice data.

[0410] "Means of converting to text data" refers to the process of converting acquired audio data into text format.

[0411] "A means of generating diary data by performing natural language generation" refers to a technology that automatically creates a structured diary using natural language processing based on text data.

[0412] "A means of inputting an image, analyzing the image, and identifying a specific event" refers to an algorithm that accurately recognizes image data and identifies the events or situations captured within it.

[0413] "A means of analyzing emotions based on voice and adding emotional information to diary data" refers to the process of extracting emotions from voice data and integrating them into diary data.

[0414] A "new information processing device" is a device designed using the latest technology and possesses high data processing capabilities.

[0415] "A means of automatically generating records using digital devices and saving childcare records from multiple perspectives" refers to a method of automatically collecting and saving childcare-related information using multiple electronic devices.

[0416] This invention is a system that automatically generates childcare records and reduces the emotional and cognitive burden on users. This system integrates a voice input device, a voice processing server, an image recognition device, a cloud synchronization system, and an emotion analysis engine to support parents.

[0417] The user inputs conversations about childcare using a device equipped with speech recognition capabilities. The device sends the audio data to a cloud server, which uses the Google Cloud Speech-to-Text API to convert the audio into text data. The resulting text data is then processed using OpenAI's GPT model to generate a natural language, creating an emotionally rich childcare diary.

[0418] During this process, the Microsoft Azure Text Analytics API analyzes the tone and speed of the voice to extract the user's emotions. This emotional information is integrated into the diary data, allowing the user to later reflect on their feelings at that time. Additionally, photos taken by the user are analyzed by Amazon Rekognition to identify specific events and people. This ensures that the record includes visual elements as well.

[0419] All generated data is stored in the cloud using Firebase and can be synchronized with other digital devices. This allows users to view and edit their parenting records from various devices. The system enriches parents' parenting experience by recording daily moments of parenting in real time and providing emotional support.

[0420] For example, if a user voice-inputs "Today my child rode a bicycle for the first time," the system can analyze the content, extract "joy" and "sense of accomplishment" using its emotion engine, and reflect them in the diary. An example of a prompt would be, "My 3-year-old child sang a song by themselves for the first time. Please generate a record that reflects the joy and surprise the parent felt at that moment."

[0421] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0422] Step 1:

[0423] The user inputs voice using a device. The input voice data is temporarily stored on the device and then sent to a cloud server for speech recognition. Here, the voice is captured as data without any user intervention.

[0424] Step 2:

[0425] The server uses the Google Cloud Speech-to-Text API to convert the transmitted audio data into text data. It parses the audio data and converts its content into structured text format. The input here is audio data, and the output is text data.

[0426] Step 3:

[0427] The server inputs the acquired text data into an OpenAI GPT model and performs natural language generation. The generative AI model generates detailed and emotionally rich diary-style entries based on the input text. Here, the input is text data, and the output is diary entries.

[0428] Step 4:

[0429] The server uses the Microsoft Azure Text Analytics API to perform sentiment analysis on the audio data, extracting emotions from the user's tone and speed of voice. This analyzes the emotional state, and the resulting emotional information is directly added to the diary data. The input is audio data, and the output is data containing emotional information.

[0430] Step 5:

[0431] The user takes an image with their device, and the image data is sent to a cloud server. A server using Amazon Rekognition analyzes the image to identify specific events or people. The analysis results are converted into text format and integrated as visual information into the diary data. The input is image data, and the output is the analyzed information used in the diary.

[0432] Step 6:

[0433] The server stores the generated diary data using Firebase cloud storage and synchronizes it with other digital devices as needed. This allows users to view and edit their childcare records from any device. The input is all the data that makes up the diary, and the output is the diary data stored in cloud storage.

[0434] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0435] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0436] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0437] [Third Embodiment]

[0438] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0439] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0440] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0441] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0442] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0443] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0444] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0445] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0446] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0447] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0448] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0449] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0450] This invention provides a system that automatically creates childcare records using voice and images, eliminating the need for parents to manually record important moments in childcare while allowing them to save and manage detailed data.

[0451] When a user speaks about childcare-related events, the device captures the user's voice using its voice input function. The recorded audio is first acquired as audio data in real time by the device and then sent to the server. This system uses a speech recognition algorithm to convert the audio data into text data.

[0452] After the server converts the audio to text, a natural language generation engine runs to create a diary-style document based on the converted text data. At this stage, grammatical consistency and readability are considered, and the text is refined to be easy for the user to understand when reviewing it later.

[0453] Furthermore, photos related to childcare taken by the user are transferred from the device to the server. The server analyzes these images, performing facial recognition and behavioral recognition to identify specific events. When important moments are recognized, corresponding explanations are added to the diary data.

[0454] The server then performs emotion analysis on the audio data. By analyzing the tone and speed of the voice, emotions such as joy and surprise are reflected in the diary data. This allows users to have richer records with emotional nuances.

[0455] Ultimately, all data is securely stored in the cloud and synchronized across the user's other devices. This allows users to access their diaries from any device and edit them as needed. Based on this, users can review their records on a daily basis and share them with their families.

[0456] For example, if a user says, "Today my child rode a bicycle for the first time," and takes a photo, the natural language generation engine will generate a detailed diary entry such as, "Today, my child rode a bicycle for the first time, and here they are, pedaling with all their might. It was a day when family and friends celebrated with them." This entire process is automated, and the user does not need to do anything special.

[0457] The following describes the processing flow.

[0458] Step 1:

[0459] When a user speaks about events related to childcare, the device activates its voice input function and records the user's voice. The device suppresses ambient noise and saves the recording as clear audio data.

[0460] Step 2:

[0461] The device transfers the recorded audio data to the server. During the transfer, encryption is applied to protect the data, and a secure communication channel is used.

[0462] Step 3:

[0463] The server analyzes the received audio data using a speech recognition system and converts it into linguistic text. This process uses a language model to correct any errors in speech recognition.

[0464] Step 4:

[0465] The server sends text data based on voice input to a natural language generation engine. The engine generates a contextual diary entry based on the text and adjusts it to accurately reflect the user's intent.

[0466] Step 5:

[0467] The device receives images from the user and sends the image data to the server. The images are converted to the appropriate format beforehand and sent while maintaining high resolution.

[0468] Step 6:

[0469] The server uses an image recognition algorithm to analyze the received images and identify people's faces or specific scenes. When important events are extracted, information about them is added to the diary.

[0470] Step 7:

[0471] The server performs sentiment analysis on the voice data. This allows it to determine emotional nuances from the user's tone of voice and phrasing, and incorporate emotional elements into the generated diary.

[0472] Step 8:

[0473] The server saves the final diary data to a cloud database. The data is synchronized with other user devices in the cloud, establishing a state where users can access their diaries from any device.

[0474] Step 9:

[0475] Users can view the generated diary data on their device and edit or add comments as needed. Completed diaries are managed within the app and can be viewed at any time.

[0476] (Example 1)

[0477] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0478] Recording and managing important moments in childcare in detail is a significant burden for parents. In particular, there is a need for a method that allows for easy recording of important events using audio and images, and automatically generates diary-style data organized based on emotions and events. Conventional technologies perform individual voice recognition, image analysis, and emotion analysis independently, lacking a system that integrates these processes. A system is needed to address these challenges and manage childcare records more simply and comprehensively.

[0479] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0480] In this invention, the server includes a device that converts speech into text information, a device that generates text and records based on the text information, a device that analyzes images to identify specific activities, and a device that performs emotion analysis and adds emotional information to the records. This makes it possible to automatically generate detailed and emotionally nuanced childcare records based on speech and images, thereby reducing the burden on parents.

[0481] A "speech input device" refers to an input device that captures a user's speech and converts it into electrical signals.

[0482] A "device that converts to text information" refers to a processing device that analyzes audio signals and generates corresponding text data.

[0483] A "device that generates text and records data" refers to a device that uses natural language processing to create a consistent written record based on input text data.

[0484] A "device that analyzes images to identify specific activities" refers to a processing device that analyzes input image data to identify and recognize specific situations or actions.

[0485] A "device that performs emotion analysis and adds emotional information to recorded data" refers to a device that extracts emotional information from audio and images and reflects it in the recorded data.

[0486] A "remote storage medium" refers to a remote storage system accessible via the internet, used for storing and sharing recorded data.

[0487] "A function that provides an operation screen to the user" refers to a means of providing an interface for the user to view, edit, and manage recorded data.

[0488] This invention illustrates an embodiment of a system that automatically creates childcare records using voice and images. The system primarily involves a terminal, a server, and a user each playing their respective roles, collectively achieving comprehensive record generation.

[0489] The device captures the user's voice using its voice input function. Mobile devices such as smartphones and tablets are used for this purpose. The captured voice is acquired as digital data by the device and transmitted to a server via a communication network. The voice data is then processed on the server.

[0490] The server converts audio data into text data using a speech recognition algorithm. Specifically, it utilizes a common speech recognition service for this conversion. Next, the server uses a natural language generation engine to create a diary document based on this text data. The generative AI model generates text that takes grammatical consistency and readability into consideration. This process allows users to easily obtain a detailed text record from audio.

[0491] Furthermore, photos related to childcare taken by the user are sent to the server via the device. The server analyzes these images and identifies specific events through facial recognition and activity recognition. Modern image recognition tools are used for this image analysis. Descriptions corresponding to the identified events are automatically added to the diary entry.

[0492] In addition, the server performs sentiment analysis using the audio data. Through this analysis, emotions are extracted from the tone and speed of the voice, and these are reflected in the recorded document. Dedicated software for sentiment analysis is used for this purpose.

[0493] Ultimately, all recorded data is securely stored on cloud-based storage. This allows users to access the data from any device and review and edit the records as needed.

[0494] As a concrete example, consider a scenario where a user says, "Today my child learned to ride a bicycle for the first time," and takes a photo. In this case, an example prompt would be "The day my child learned to ride a bicycle for the first time." The generative AI model would then generate a record stating, "Today, my child learned to ride a bicycle for the first time, pedaling with all their might. It was a day celebrated by family and friends." This entire process is fully automated, requiring no special action from the user.

[0495] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0496] Step 1:

[0497] The device captures the user's voice. It takes the user's speech as input via a microphone and generates digital audio data. Since this audio data cannot be used directly as text, it is first sent to the server in digital format.

[0498] Step 2:

[0499] The server processes the received audio data through a speech recognition algorithm. The input is audio data sent from the terminal, which is converted into text data by the speech recognition software. This process analyzes the audio waveform and produces output as a sequence of words.

[0500] Step 3:

[0501] The server uses a generative AI model to convert text data into diary-style entries. The input is text data obtained from speech recognition, and natural language generation technology is used for data processing to output a grammatically consistent and emotionally appropriate record document.

[0502] Step 4:

[0503] The user sends images they have taken from their device to the server. The input is image data acquired by the device. The device compresses the image data into an appropriate format and transfers it to the server via the communication network.

[0504] Step 5:

[0505] The server performs image analysis to identify specific events. The input is image data sent from the terminal, and through data processing using image recognition technology, it extracts features of events and situations to identify specific activities and outputs the results.

[0506] Step 6:

[0507] The server performs emotion analysis based on the audio data. The input is the initial audio data, and emotion analysis software analyzes the tone and speed of the voice, outputting emotional information such as joy, anger, sadness, and happiness, and reflecting the results in the diary document.

[0508] Step 7:

[0509] The server saves the final record data to the cloud and synchronizes it with other devices. The input is the final record document generated on the server, and the output is securely stored in cloud storage. Furthermore, the synchronization process is performed automatically, and users can access the records from various devices.

[0510] (Application Example 1)

[0511] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0512] For many parents, automatically recording important moments in childcare and generating detailed childcare diaries is a cumbersome task. In particular, it is difficult to record every moment of childcare without missing any and to accurately capture and preserve the nuances of emotions. Furthermore, sharing the generated records with other family members requires synchronization and access across different devices. This invention aims to solve these problems when raising children at home and to generate and share childcare records more easily and effectively.

[0513] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0514] In this invention, the server includes means for inputting voice, means for converting the voice into text data, means for generating diary data by performing natural language generation based on the text data, means for inputting images and analyzing the images to identify specific events, means for analyzing emotions based on the voice and adding emotional information to the diary data, means for a mobile device to patrol the home and record voice and images in real time, means for transmitting the generated diary data to the cloud, and means for making the diary data accessible from a remote location. This makes it possible to automatically record important moments during childcare and generate emotionally rich diaries. Furthermore, the recorded data is securely stored via the cloud and can be easily accessed and shared from multiple devices.

[0515] "Means for inputting audio" refers to a device or technology that acquires an audio signal from an external source and prepares it for subsequent processing.

[0516] "Means of converting to text data" refers to the technology of converting acquired audio signals into text information, utilizing speech recognition technology.

[0517] "A means of generating diary data using natural language generation" refers to a technology that automatically generates diary-style data in a human-readable language format based on text data.

[0518] "Means for inputting an image and analyzing the image to identify a specific event" refers to a technology that takes in and analyzes image data to identify specific events or situations occurring within the image.

[0519] "Means for analyzing emotions and adding emotional information to the diary data" refers to a technique that analyzes the characteristics of audio data, estimates the speaker's emotional state, and then adds information related to those emotions to the diary data.

[0520] "A means for a mobile device to patrol a home and record audio and images in real time" refers to a technology in which a device that can move around a home continuously acquires and records audio and image data in real time.

[0521] "Methods for sending generated diary data to the cloud" refers to technologies that save locally generated diary data to a cloud storage system via the internet.

[0522] "Methods for making diary data accessible remotely" refers to technologies that allow diary data stored in the cloud to be safely and easily viewed from other locations, overcoming physical limitations.

[0523] "Means of synchronizing with other computing devices" refers to technologies that maintain data consistency between different devices while updating and sharing the same data in real time.

[0524] "Means of providing a display device to a user" refers to technology that includes a screen or device for the user to view and edit generated data.

[0525] The system to realize this application consists of a childcare support robot that operates continuously in the home, recording and analyzing audio and images during childcare. The server uses a mobile unit equipped with an audio input device and an image capture device to record parent-child interactions in real time.

[0526] The voice input device captures parental speech and child vocalizations and sends them to a server. The server uses speech recognition software to convert the voice data into text data. The "speech_recognition" library is used in this process, and the converted text data is further processed into a diary-style document by a natural language generation engine.

[0527] Next, the server receives image data captured by the image capture device and uses image analysis algorithms to recognize specific events. In this process, facial recognition and behavioral recognition technologies are utilized to identify important moments in childcare. The identified events are then added to the appropriate locations in the diary data.

[0528] The server also performs emotion analysis based on the voice data to estimate the emotional state of the parent or child. Using the "emotion_analysis" module, it analyzes emotions from the tone and speed of the voice and adds that information to the diary data.

[0529] The generated diary data is securely stored in cloud storage. Through the "cloud_storage_module," the data can be accessed from multiple devices, allowing users to refer to their records from anywhere and share them with family members as needed.

[0530] For example, if a user says, "Today my child drew a picture for the first time," and takes a photo of that moment, the system will automatically generate a diary entry such as, "Today my child drew a picture with all their heart for the first time. It was a very touching moment, and the whole family celebrated." This allows users to record precious moments in parenting in detail without having to take any special action.

[0531] Examples of prompts to input into a generative AI model:

[0532] "Identify touching moments related to childcare from audio data and generate emotionally rich entries in diary format."

[0533] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0534] Step 1:

[0535] The server acquires parental speech and child vocalizations in real time from the voice input device installed in the terminal. When an audio signal is input, the server passes the audio signal to speech recognition software, which converts it into text data. At this time, the audio data is sampled, features are extracted, and then it is converted into a string.

[0536] Step 2:

[0537] Based on the converted text data, the server uses a natural language generation engine to generate a diary-style document. Taking the text data as input, and based on a language model, it outputs human-readable text. This creates a diary documenting events related to childcare.

[0538] Step 3:

[0539] The server receives image data acquired from the terminal's image capture device. The input images are processed using image analysis algorithms. The server performs face recognition and behavior recognition to identify specific events. The identified event information is incorporated into the diary data, and appropriate explanations are added.

[0540] Step 4:

[0541] Based on the audio data, the server performs emotion analysis. The audio data is input, and by analyzing its acoustic characteristics, the emotional state of the parent or child is estimated. As a result of the analysis, emotional information such as joy or surprise is generated and added to the diary data.

[0542] Step 5:

[0543] The server uploads the generated diary data to cloud storage. The diary data is then securely transmitted via a cloud API. The diary data stored in the cloud is synchronized and accessible from other devices.

[0544] Step 6:

[0545] Users can access their cloud-based diary data from another device as needed. By starting their device and accessing cloud storage, the saved diary data is downloaded and displayed. This allows users to review and share their childcare records with their family from anywhere.

[0546] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0547] This invention incorporates an emotion engine that analyzes the user's emotions, in addition to a system that automatically creates childcare records using voice and image data. This system allows users to record important moments in childcare and save them, including the emotions they felt at the time, without requiring any special operation.

[0548] When a user speaks about childcare, the device activates its voice input function and records the audio data. The recorded audio data is then transferred by the device to a server. This system uses speech recognition technology to convert the audio data into text, which is then further processed on the server. This text data is processed by a natural language generation engine, and the text is generated in a diary format.

[0549] In addition, the system is equipped with an emotion engine to recognize the user's emotions. Based on voice data, the emotion engine analyzes the user's voice quality, tone, and speed to determine their emotional state. This emotional information is then added to the diary data, allowing the user to reflect on their emotions at that time. Furthermore, the emotion engine extracts intentions and emotions from text data and adds appropriate nuances to the writing.

[0550] For example, if a user says, "I was so happy today because my child rode a bicycle for the first time," the emotion engine extracts the emotions of "joy" and "sense of accomplishment" from that voice and reflects them in the diary data. This generates a diary entry that vividly conveys emotions such as, "Today, my child was able to ride a bicycle by themselves for the first time, and as a parent, I was extremely happy."

[0551] When a user takes a photo with their device, the image data is sent to a server, where an image recognition algorithm identifies important events and people. The image recognition results are incorporated into the diary data as visual elements and saved together with the text.

[0552] All generated data is stored in the cloud and synchronized with other devices. This synchronization allows users to access their diaries from various devices and easily view and edit complete childcare records, including diagrams, photos, and text.

[0553] Thus, the present invention provides a system that allows for the preservation of deeper memories by analyzing and recording the emotional elements in childcare records.

[0554] The following describes the processing flow.

[0555] Step 1:

[0556] When a user speaks about their experiences with childcare, the device activates its voice input function and records the user's voice. The recorded audio data is saved to the device immediately.

[0557] Step 2:

[0558] Once recording is complete, the device transfers the audio data to the server. During data transfer, the audio data is encrypted to protect privacy.

[0559] Step 3:

[0560] The server receives the audio data, which is then analyzed by a speech recognition module to convert it into appropriate text data. Acoustic and linguistic models are used during this process to ensure accurate text transcription.

[0561] Step 4:

[0562] The server passes text data to a natural language generation engine, which then generates meaningful sentences in a diary format. This process constructs sentences that include contextually relevant content, ensuring the text is easily readable.

[0563] Step 5:

[0564] The server analyzes the received audio data using an emotion engine to identify emotions from the user's voice tone and speaking style. The identified emotion information is then added as emotion labels to the naturally generated diary data.

[0565] Step 6:

[0566] When a user takes a photo while taking care of their child, the device sends the image data to the server. Before uploading, the photos undergo moderate compression to optimize their size while ensuring high quality.

[0567] Step 7:

[0568] The server uses image recognition technology to analyze the transmitted images. Here, it recognizes specific events and situations through facial recognition and object detection, and organizes the content deemed important to reflect in the diary.

[0569] Step 8:

[0570] The server generates the final diary data, saving the included text, images, and sentiment information as a single document to the cloud. During this process, data integrity is verified, and the data is synchronized in real time across the user's other devices.

[0571] Step 9:

[0572] Users can view diary data generated from any device. The application provides users with a user-friendly interface, allowing them to add comments and edit records as needed.

[0573] (Example 2)

[0574] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0575] Keeping a record of child-rearing typically requires parents to manually write diaries or compile photos into albums, which is time-consuming and laborious. Furthermore, it's difficult to reflect emotions and feelings in these records in real time. Additionally, easily accessing and editing the generated records across multiple devices is challenging.

[0576] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0577] In this invention, the server includes a device for inputting voice, a device for converting the voice into text data, and a device for generating recorded data by performing natural language generation based on the text data. This allows users to record important moments in childcare through voice input without having to manually enter detailed information themselves, and to automatically generate records that accurately reflect emotions and content. Furthermore, this data is managed on the cloud, enabling easy synchronization and editing across various devices.

[0578] A "speech input device" is a device that has the function of capturing a user's speech as a digital signal.

[0579] A "device that converts to text data" is a device that analyzes audio signals and generates corresponding strings of characters.

[0580] A "device that generates natural language and records data" is a device that automatically creates human-readable text based on character data and saves it as a record.

[0581] A "device that analyzes images to identify specific events" is a device that analyzes input image data and has the function of identifying important events or elements from it.

[0582] A "device that analyzes emotions and adds emotional information to the recorded data" is a device that extracts emotions from voice or text data and adds them to the recorded data, thereby giving the content emotional nuances.

[0583] A "device that enhances recorded data as a visual element" is a device that has the function of making information recorded using image data more visually appealing and richer in content.

[0584] An "information management infrastructure" is a network infrastructure for safely and efficiently storing and managing digital data.

[0585] An "information processing terminal" is an electronic device capable of creating, editing, saving, and transmitting recorded data.

[0586] A "user interface" is software or hardware that provides screens and operating methods used by users to access and manipulate digital data.

[0587] This invention is a system for automatically generating childcare records, utilizing voice input, emotion analysis, and image recognition technologies. This system allows users to record daily childcare moments and save them, including the emotions associated with them, without requiring any special operation.

[0588] The system works as follows: When a user speaks about childcare, the device uses an audio input device to record the audio as a digital signal. Information processing terminals such as smartphones and tablets are used for this purpose. The recorded audio data is transferred from the device to the server via a secure communication protocol.

[0589] The server converts the received audio data into text data using speech recognition software. Specifically, a general speech recognition API is used. This text data is processed by a natural language generation engine and generated as recorded data. A general generative AI model is applied to the natural language generation engine used here.

[0590] Furthermore, the server uses an emotion engine to analyze emotions from the audio data. This process determines the user's emotions based on factors such as voice quality, tone, and speed, and adds this information to the recorded data. Emotions are also extracted from text data, and appropriate nuances are added to the text.

[0591] When a user takes childcare-related photos with their device, the image data is sent to a server. The server uses an image recognition algorithm to identify important events and people. The identified information is incorporated into the recorded data as visual information and stored together with text.

[0592] All data is stored in the cloud and synchronized across different devices. This allows users to easily access, view, and edit recorded data from their home PC or devices while on the go.

[0593] For example, if a user says, "I was so happy today because my child rode a bicycle for the first time," the emotion engine extracts the emotions of "joy" and "accomplishment" from this audio and reflects them in the recorded data. This generates emotionally rich sentences such as, "Today, my child rode a bicycle by themselves for the first time. I was so happy."

[0594] An example of a prompt message is: "Audio data: Today my child was happy to ride a bicycle for the first time. Emotion: Joy, sense of accomplishment. Please summarize this in a diary entry."

[0595] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0596] Step 1:

[0597] When a user begins speaking about childcare, the device uses its voice input device to record the audio as a digital signal in real time. The input data is the user's voice, which is then recorded and output as digital audio data. Smartphones and tablets are useful for this process.

[0598] Step 2:

[0599] The terminal transfers recorded audio data to the server using a secure communication protocol. The input here is digitized audio data, and the output is the secure transmission of data to the server. Appropriate encryption technology is used to ensure that the audio data reliably reaches the server.

[0600] Step 3:

[0601] The server converts the received audio data into text data using speech recognition software. The input is the audio data stored on the server, and the output is the text data generated by speech recognition. In this step, the audio signal is analyzed using a speech recognition API and the corresponding text is generated.

[0602] Step 4:

[0603] The server processes the generated text data using a natural language generation engine and generates conversational text as recorded data. The input is text data generated by speech recognition, and the output is text in natural language format. Text generation based on prompt sentences is performed using a generation AI model.

[0604] Step 5:

[0605] The server uses an emotion engine to analyze emotions from audio data and adds emotional information to the recorded data. The input here is audio and text data, and the output is recorded data with added emotional information. By analyzing the voice tone and speed, the server identifies the user's current emotions and reflects them in the recording.

[0606] Step 6:

[0607] The user sends image data captured with their device to the server. The input is the captured image data, and the output is the transmission of data to the server. This is done easily and quickly because it utilizes cloud-based data sharing.

[0608] Step 7:

[0609] The server uses image recognition algorithms to analyze transmitted image data and identify specific events or important individuals. The input is image data, and the output is recognized event information. Image content is filtered, tagged, and integrated into the recorded data.

[0610] Step 8:

[0611] The server stores all data in the cloud and synchronizes it with different data processing terminals. Input is the final recorded data, and output is storage in the cloud and data synchronization to different terminals. This allows users to access and edit data from anywhere.

[0612] (Application Example 2)

[0613] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0614] In childcare, there is a need for parents to record important moments without missing them and to reflect their emotions and thoughts in real time. Furthermore, there is a need for systems that reduce the emotional and cognitive burden on parents and support the childcare process. Conventional technologies have limited capabilities for automatic generation of childcare records and emotion analysis, failing to fully utilize the potential of multi-functional digital devices.

[0615] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0616] In this invention, the server includes means for inputting voice, means for converting voice into text data, means for generating diary data by performing natural language generation based on the text data, means for inputting images and analyzing images to identify specific events, means for analyzing emotions based on voice and adding emotional information to the diary data, means for analyzing the overall nature of the diary data and providing emotional and cognitive support using a new information processing device, and means for automatically generating records using multiple digital devices and saving childcare records from multiple perspectives. As a result, parents can richly record moments of childcare and confidently look back on emotions and events without requiring any special operation.

[0617] "Means of inputting voice" refers to a function used by an information processing device to accurately acquire voice data.

[0618] "Means of converting to text data" refers to the process of converting acquired audio data into text format.

[0619] "A means of generating diary data by performing natural language generation" refers to a technology that automatically creates a structured diary using natural language processing based on text data.

[0620] "A means of inputting an image, analyzing the image, and identifying a specific event" refers to an algorithm that accurately recognizes image data and identifies the events or situations captured within it.

[0621] "A means of analyzing emotions based on voice and adding emotional information to diary data" refers to the process of extracting emotions from voice data and integrating them into diary data.

[0622] A "new information processing device" is a device designed using the latest technology and possesses high data processing capabilities.

[0623] "A means of automatically generating records using digital devices and saving childcare records from multiple perspectives" refers to a method of automatically collecting and saving childcare-related information using multiple electronic devices.

[0624] This invention is a system that automatically generates childcare records and reduces the emotional and cognitive burden on users. This system integrates a voice input device, a voice processing server, an image recognition device, a cloud synchronization system, and an emotion analysis engine to support parents.

[0625] The user inputs conversations about childcare using a device equipped with speech recognition capabilities. The device sends the audio data to a cloud server, which uses the Google Cloud Speech-to-Text API to convert the audio into text data. The resulting text data is then processed using OpenAI's GPT model to generate a natural language, creating an emotionally rich childcare diary.

[0626] During this process, the Microsoft Azure Text Analytics API analyzes the tone and speed of the voice to extract the user's emotions. This emotional information is integrated into the diary data, allowing the user to later reflect on their feelings at that time. Additionally, photos taken by the user are analyzed by Amazon Rekognition to identify specific events and people. This ensures that the record includes visual elements as well.

[0627] All generated data is stored in the cloud using Firebase and can be synchronized with other digital devices. This allows users to view and edit their parenting records from various devices. The system enriches parents' parenting experience by recording daily moments of parenting in real time and providing emotional support.

[0628] For example, if a user voice-inputs "Today my child rode a bicycle for the first time," the system can analyze the content, extract "joy" and "sense of accomplishment" using its emotion engine, and reflect them in the diary. An example of a prompt would be, "My 3-year-old child sang a song by themselves for the first time. Please generate a record that reflects the joy and surprise the parent felt at that moment."

[0629] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0630] Step 1:

[0631] The user inputs voice using a device. The input voice data is temporarily stored on the device and then sent to a cloud server for speech recognition. Here, the voice is captured as data without any user intervention.

[0632] Step 2:

[0633] The server uses the Google Cloud Speech-to-Text API to convert the transmitted audio data into text data. It parses the audio data and converts its content into structured text format. The input here is audio data, and the output is text data.

[0634] Step 3:

[0635] The server inputs the acquired text data into an OpenAI GPT model and performs natural language generation. The generative AI model generates detailed and emotionally rich diary-style entries based on the input text. Here, the input is text data, and the output is diary entries.

[0636] Step 4:

[0637] The server uses the Microsoft Azure Text Analytics API to perform sentiment analysis on the audio data, extracting emotions from the user's tone and speed of voice. This analyzes the emotional state, and the resulting emotional information is directly added to the diary data. The input is audio data, and the output is data containing emotional information.

[0638] Step 5:

[0639] The user takes an image with their device, and the image data is sent to a cloud server. A server using Amazon Rekognition analyzes the image to identify specific events or people. The analysis results are converted into text format and integrated as visual information into the diary data. The input is image data, and the output is the analyzed information used in the diary.

[0640] Step 6:

[0641] The server stores the generated diary data using Firebase cloud storage and synchronizes it with other digital devices as needed. This allows users to view and edit their childcare records from any device. The input is all the data that makes up the diary, and the output is the diary data stored in cloud storage.

[0642] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0643] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0644] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0645] [Fourth Embodiment]

[0646] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0647] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0648] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0649] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0650] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0651] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0652] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0653] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0654] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0655] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0656] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0657] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0658] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0659] This invention provides a system that automatically creates childcare records using voice and images, eliminating the need for parents to manually record important moments in childcare while allowing them to save and manage detailed data.

[0660] When a user speaks about childcare-related events, the device captures the user's voice using its voice input function. The recorded audio is first acquired as audio data in real time by the device and then sent to the server. This system uses a speech recognition algorithm to convert the audio data into text data.

[0661] After the server converts the audio to text, a natural language generation engine runs to create a diary-style document based on the converted text data. At this stage, grammatical consistency and readability are considered, and the text is refined to be easy for the user to understand when reviewing it later.

[0662] Furthermore, photos related to childcare taken by the user are transferred from the device to the server. The server analyzes these images, performing facial recognition and behavioral recognition to identify specific events. When important moments are recognized, corresponding explanations are added to the diary data.

[0663] The server then performs emotion analysis on the audio data. By analyzing the tone and speed of the voice, emotions such as joy and surprise are reflected in the diary data. This allows users to have richer records with emotional nuances.

[0664] Ultimately, all data is securely stored in the cloud and synchronized across the user's other devices. This allows users to access their diaries from any device and edit them as needed. Based on this, users can review their records on a daily basis and share them with their families.

[0665] For example, if a user says, "Today my child rode a bicycle for the first time," and takes a photo, the natural language generation engine will generate a detailed diary entry such as, "Today, my child rode a bicycle for the first time, and here they are, pedaling with all their might. It was a day when family and friends celebrated with them." This entire process is automated, and the user does not need to do anything special.

[0666] The following describes the processing flow.

[0667] Step 1:

[0668] When a user speaks about events related to childcare, the device activates its voice input function and records the user's voice. The device suppresses ambient noise and saves the recording as clear audio data.

[0669] Step 2:

[0670] The device transfers the recorded audio data to the server. During the transfer, encryption is applied to protect the data, and a secure communication channel is used.

[0671] Step 3:

[0672] The server analyzes the received audio data using a speech recognition system and converts it into linguistic text. This process uses a language model to correct any errors in speech recognition.

[0673] Step 4:

[0674] The server sends text data based on voice input to a natural language generation engine. The engine generates a contextual diary entry based on the text and adjusts it to accurately reflect the user's intent.

[0675] Step 5:

[0676] The device receives images from the user and sends the image data to the server. The images are converted to the appropriate format beforehand and sent while maintaining high resolution.

[0677] Step 6:

[0678] The server uses an image recognition algorithm to analyze the received images and identify people's faces or specific scenes. When important events are extracted, information about them is added to the diary.

[0679] Step 7:

[0680] The server performs sentiment analysis on the voice data. This allows it to determine emotional nuances from the user's tone of voice and phrasing, and incorporate emotional elements into the generated diary.

[0681] Step 8:

[0682] The server saves the final diary data to a cloud database. The data is synchronized with other user devices in the cloud, establishing a state where users can access their diaries from any device.

[0683] Step 9:

[0684] Users can view the generated diary data on their device and edit or add comments as needed. Completed diaries are managed within the app and can be viewed at any time.

[0685] (Example 1)

[0686] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0687] Recording and managing important moments in childcare in detail is a significant burden for parents. In particular, there is a need for a method that allows for easy recording of important events using audio and images, and automatically generates diary-style data organized based on emotions and events. Conventional technologies perform individual voice recognition, image analysis, and emotion analysis independently, lacking a system that integrates these processes. A system is needed to address these challenges and manage childcare records more simply and comprehensively.

[0688] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0689] In this invention, the server includes a device that converts speech into text information, a device that generates text and records based on the text information, a device that analyzes images to identify specific activities, and a device that performs emotion analysis and adds emotional information to the records. This makes it possible to automatically generate detailed and emotionally nuanced childcare records based on speech and images, thereby reducing the burden on parents.

[0690] A "speech input device" refers to an input device that captures a user's speech and converts it into electrical signals.

[0691] A "device that converts to text information" refers to a processing device that analyzes audio signals and generates corresponding text data.

[0692] A "device that generates text and records data" refers to a device that uses natural language processing to create a consistent written record based on input text data.

[0693] A "device that analyzes images to identify specific activities" refers to a processing device that analyzes input image data to identify and recognize specific situations or actions.

[0694] A "device that performs emotion analysis and adds emotional information to recorded data" refers to a device that extracts emotional information from audio and images and reflects it in the recorded data.

[0695] A "remote storage medium" refers to a remote storage system accessible via the internet, used for storing and sharing recorded data.

[0696] "A function that provides an operation screen to the user" refers to a means of providing an interface for the user to view, edit, and manage recorded data.

[0697] This invention illustrates an embodiment of a system that automatically creates childcare records using voice and images. The system primarily involves a terminal, a server, and a user each playing their respective roles, collectively achieving comprehensive record generation.

[0698] The device captures the user's voice using its voice input function. Mobile devices such as smartphones and tablets are used for this purpose. The captured voice is acquired as digital data by the device and transmitted to a server via a communication network. The voice data is then processed on the server.

[0699] The server converts audio data into text data using a speech recognition algorithm. Specifically, it utilizes a common speech recognition service for this conversion. Next, the server uses a natural language generation engine to create a diary document based on this text data. The generative AI model generates text that takes grammatical consistency and readability into consideration. This process allows users to easily obtain a detailed text record from audio.

[0700] Furthermore, photos related to childcare taken by the user are sent to the server via the device. The server analyzes these images and identifies specific events through facial recognition and activity recognition. Modern image recognition tools are used for this image analysis. Descriptions corresponding to the identified events are automatically added to the diary entry.

[0701] In addition, the server performs sentiment analysis using the audio data. Through this analysis, emotions are extracted from the tone and speed of the voice, and these are reflected in the recorded document. Dedicated software for sentiment analysis is used for this purpose.

[0702] Ultimately, all recorded data is securely stored on cloud-based storage. This allows users to access the data from any device and review and edit the records as needed.

[0703] As a concrete example, consider a scenario where a user says, "Today my child learned to ride a bicycle for the first time," and takes a photo. In this case, an example prompt would be "The day my child learned to ride a bicycle for the first time." The generative AI model would then generate a record stating, "Today, my child learned to ride a bicycle for the first time, pedaling with all their might. It was a day celebrated by family and friends." This entire process is fully automated, requiring no special action from the user.

[0704] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0705] Step 1:

[0706] The device captures the user's voice. It takes the user's speech as input via a microphone and generates digital audio data. Since this audio data cannot be used directly as text, it is first sent to the server in digital format.

[0707] Step 2:

[0708] The server processes the received audio data through a speech recognition algorithm. The input is audio data sent from the terminal, which is converted into text data by the speech recognition software. This process analyzes the audio waveform and produces output as a sequence of words.

[0709] Step 3:

[0710] The server uses a generative AI model to convert text data into diary-style entries. The input is text data obtained from speech recognition, and natural language generation technology is used for data processing to output a grammatically consistent and emotionally appropriate record document.

[0711] Step 4:

[0712] The user sends images they have taken from their device to the server. The input is image data acquired by the device. The device compresses the image data into an appropriate format and transfers it to the server via the communication network.

[0713] Step 5:

[0714] The server performs image analysis to identify specific events. The input is image data sent from the terminal, and through data processing using image recognition technology, it extracts features of events and situations to identify specific activities and outputs the results.

[0715] Step 6:

[0716] The server performs emotion analysis based on the audio data. The input is the initial audio data, and emotion analysis software analyzes the tone and speed of the voice, outputting emotional information such as joy, anger, sadness, and happiness, and reflecting the results in the diary document.

[0717] Step 7:

[0718] The server saves the final record data to the cloud and synchronizes it with other devices. The input is the final record document generated on the server, and the output is securely stored in cloud storage. Furthermore, the synchronization process is performed automatically, and users can access the records from various devices.

[0719] (Application Example 1)

[0720] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0721] For many parents, automatically recording important moments in childcare and generating detailed childcare diaries is a cumbersome task. In particular, it is difficult to record every moment of childcare without missing any and to accurately capture and preserve the nuances of emotions. Furthermore, sharing the generated records with other family members requires synchronization and access across different devices. This invention aims to solve these problems when raising children at home and to generate and share childcare records more easily and effectively.

[0722] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0723] In this invention, the server includes means for inputting voice, means for converting the voice into text data, means for generating diary data by performing natural language generation based on the text data, means for inputting images and analyzing the images to identify specific events, means for analyzing emotions based on the voice and adding emotional information to the diary data, means for a mobile device to patrol the home and record voice and images in real time, means for transmitting the generated diary data to the cloud, and means for making the diary data accessible from a remote location. This makes it possible to automatically record important moments during childcare and generate emotionally rich diaries. Furthermore, the recorded data is securely stored via the cloud and can be easily accessed and shared from multiple devices.

[0724] "Means for inputting audio" refers to a device or technology that acquires an audio signal from an external source and prepares it for subsequent processing.

[0725] "Means of converting to text data" refers to the technology of converting acquired audio signals into text information, utilizing speech recognition technology.

[0726] "A means of generating diary data using natural language generation" refers to a technology that automatically generates diary-style data in a human-readable language format based on text data.

[0727] "Means for inputting an image and analyzing the image to identify a specific event" refers to a technology that takes in and analyzes image data to identify specific events or situations occurring within the image.

[0728] "Means for analyzing emotions and adding emotional information to the diary data" refers to a technique that analyzes the characteristics of audio data, estimates the speaker's emotional state, and then adds information related to those emotions to the diary data.

[0729] "A means for a mobile device to patrol a home and record audio and images in real time" refers to a technology in which a device that can move around a home continuously acquires and records audio and image data in real time.

[0730] "Methods for sending generated diary data to the cloud" refers to technologies that save locally generated diary data to a cloud storage system via the internet.

[0731] "Methods for making diary data accessible remotely" refers to technologies that allow diary data stored in the cloud to be safely and easily viewed from other locations, overcoming physical limitations.

[0732] "Means of synchronizing with other computing devices" refers to technologies that maintain data consistency between different devices while updating and sharing the same data in real time.

[0733] "Means of providing a display device to a user" refers to technology that includes a screen or device for the user to view and edit generated data.

[0734] The system to realize this application consists of a childcare support robot that operates continuously in the home, recording and analyzing audio and images during childcare. The server uses a mobile unit equipped with an audio input device and an image capture device to record parent-child interactions in real time.

[0735] The voice input device captures parental speech and child vocalizations and sends them to a server. The server uses speech recognition software to convert the voice data into text data. The "speech_recognition" library is used in this process, and the converted text data is further processed into a diary-style document by a natural language generation engine.

[0736] Next, the server receives image data captured by the image capture device and uses image analysis algorithms to recognize specific events. In this process, facial recognition and behavioral recognition technologies are utilized to identify important moments in childcare. The identified events are then added to the appropriate locations in the diary data.

[0737] The server also performs emotion analysis based on the voice data to estimate the emotional state of the parent or child. Using the "emotion_analysis" module, it analyzes emotions from the tone and speed of the voice and adds that information to the diary data.

[0738] The generated diary data is securely stored in cloud storage. Through the "cloud_storage_module," the data can be accessed from multiple devices, allowing users to refer to their records from anywhere and share them with family members as needed.

[0739] For example, if a user says, "Today my child drew a picture for the first time," and takes a photo of that moment, the system will automatically generate a diary entry such as, "Today my child drew a picture with all their heart for the first time. It was a very touching moment, and the whole family celebrated." This allows users to record precious moments in parenting in detail without having to take any special action.

[0740] Examples of prompts to input into a generative AI model:

[0741] "Identify touching moments related to childcare from audio data and generate emotionally rich entries in diary format."

[0742] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0743] Step 1:

[0744] The server acquires parental speech and child vocalizations in real time from the voice input device installed in the terminal. When an audio signal is input, the server passes the audio signal to speech recognition software, which converts it into text data. At this time, the audio data is sampled, features are extracted, and then it is converted into a string.

[0745] Step 2:

[0746] Based on the converted text data, the server uses a natural language generation engine to generate a diary-style document. Taking the text data as input, and based on a language model, it outputs human-readable text. This creates a diary documenting events related to childcare.

[0747] Step 3:

[0748] The server receives image data acquired from the terminal's image capture device. The input images are processed using image analysis algorithms. The server performs face recognition and behavior recognition to identify specific events. The identified event information is incorporated into the diary data, and appropriate explanations are added.

[0749] Step 4:

[0750] Based on the audio data, the server performs emotion analysis. The audio data is input, and by analyzing its acoustic characteristics, the emotional state of the parent or child is estimated. As a result of the analysis, emotional information such as joy or surprise is generated and added to the diary data.

[0751] Step 5:

[0752] The server uploads the generated diary data to cloud storage. The diary data is then securely transmitted via a cloud API. The diary data stored in the cloud is synchronized and accessible from other devices.

[0753] Step 6:

[0754] Users can access their cloud-based diary data from another device as needed. By starting their device and accessing cloud storage, the saved diary data is downloaded and displayed. This allows users to review and share their childcare records with their family from anywhere.

[0755] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0756] This invention incorporates an emotion engine that analyzes the user's emotions, in addition to a system that automatically creates childcare records using voice and image data. This system allows users to record important moments in childcare and save them, including the emotions they felt at the time, without requiring any special operation.

[0757] When a user speaks about childcare, the device activates its voice input function and records the audio data. The recorded audio data is then transferred by the device to a server. This system uses speech recognition technology to convert the audio data into text, which is then further processed on the server. This text data is processed by a natural language generation engine, and the text is generated in a diary format.

[0758] In addition, the system is equipped with an emotion engine to recognize the user's emotions. Based on voice data, the emotion engine analyzes the user's voice quality, tone, and speed to determine their emotional state. This emotional information is then added to the diary data, allowing the user to reflect on their emotions at that time. Furthermore, the emotion engine extracts intentions and emotions from text data and adds appropriate nuances to the writing.

[0759] For example, if a user says, "I was so happy today because my child rode a bicycle for the first time," the emotion engine extracts the emotions of "joy" and "sense of accomplishment" from that voice and reflects them in the diary data. This generates a diary entry that vividly conveys emotions such as, "Today, my child was able to ride a bicycle by themselves for the first time, and as a parent, I was extremely happy."

[0760] When a user takes a photo with their device, the image data is sent to a server, where an image recognition algorithm identifies important events and people. The image recognition results are incorporated into the diary data as visual elements and saved together with the text.

[0761] All generated data is stored in the cloud and synchronized with other devices. This synchronization allows users to access their diaries from various devices and easily view and edit complete childcare records, including diagrams, photos, and text.

[0762] Thus, the present invention provides a system that allows for the preservation of deeper memories by analyzing and recording the emotional elements in childcare records.

[0763] The following describes the processing flow.

[0764] Step 1:

[0765] When a user speaks about their experiences with childcare, the device activates its voice input function and records the user's voice. The recorded audio data is saved to the device immediately.

[0766] Step 2:

[0767] Once recording is complete, the device transfers the audio data to the server. During data transfer, the audio data is encrypted to protect privacy.

[0768] Step 3:

[0769] The server receives the audio data, which is then analyzed by a speech recognition module to convert it into appropriate text data. Acoustic and linguistic models are used during this process to ensure accurate text transcription.

[0770] Step 4:

[0771] The server passes text data to a natural language generation engine, which then generates meaningful sentences in a diary format. This process constructs sentences that include contextually relevant content, ensuring the text is easily readable.

[0772] Step 5:

[0773] The server analyzes the received audio data using an emotion engine to identify emotions from the user's voice tone and speaking style. The identified emotion information is then added as emotion labels to the naturally generated diary data.

[0774] Step 6:

[0775] When a user takes a photo while taking care of their child, the device sends the image data to the server. Before uploading, the photos undergo moderate compression to optimize their size while ensuring high quality.

[0776] Step 7:

[0777] The server uses image recognition technology to analyze the transmitted images. Here, it recognizes specific events and situations through facial recognition and object detection, and organizes the content deemed important to reflect in the diary.

[0778] Step 8:

[0779] The server generates the final diary data, saving the included text, images, and sentiment information as a single document to the cloud. During this process, data integrity is verified, and the data is synchronized in real time across the user's other devices.

[0780] Step 9:

[0781] Users can view diary data generated from any device. The application provides users with a user-friendly interface, allowing them to add comments and edit records as needed.

[0782] (Example 2)

[0783] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0784] Keeping a record of child-rearing typically requires parents to manually write diaries or compile photos into albums, which is time-consuming and laborious. Furthermore, it's difficult to reflect emotions and feelings in these records in real time. Additionally, easily accessing and editing the generated records across multiple devices is challenging.

[0785] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0786] In this invention, the server includes a device for inputting voice, a device for converting the voice into text data, and a device for generating recorded data by performing natural language generation based on the text data. This allows users to record important moments in childcare through voice input without having to manually enter detailed information themselves, and to automatically generate records that accurately reflect emotions and content. Furthermore, this data is managed on the cloud, enabling easy synchronization and editing across various devices.

[0787] A "speech input device" is a device that has the function of capturing a user's speech as a digital signal.

[0788] A "device that converts to text data" is a device that analyzes audio signals and generates corresponding strings of characters.

[0789] A "device that generates natural language and records data" is a device that automatically creates human-readable text based on character data and saves it as a record.

[0790] A "device that analyzes images to identify specific events" is a device that analyzes input image data and has the function of identifying important events or elements from it.

[0791] A "device that analyzes emotions and adds emotional information to the recorded data" is a device that extracts emotions from voice or text data and adds them to the recorded data, thereby giving the content emotional nuances.

[0792] A "device that enhances recorded data as a visual element" is a device that has the function of making information recorded using image data more visually appealing and richer in content.

[0793] An "information management infrastructure" is a network infrastructure for safely and efficiently storing and managing digital data.

[0794] An "information processing terminal" is an electronic device capable of creating, editing, saving, and transmitting recorded data.

[0795] A "user interface" is software or hardware that provides screens and operating methods used by users to access and manipulate digital data.

[0796] This invention is a system for automatically generating childcare records, utilizing voice input, emotion analysis, and image recognition technologies. This system allows users to record daily childcare moments and save them, including the emotions associated with them, without requiring any special operation.

[0797] The system works as follows: When a user speaks about childcare, the device uses an audio input device to record the audio as a digital signal. Information processing terminals such as smartphones and tablets are used for this purpose. The recorded audio data is transferred from the device to the server via a secure communication protocol.

[0798] The server converts the received audio data into text data using speech recognition software. Specifically, a general speech recognition API is used. This text data is processed by a natural language generation engine and generated as recorded data. A general generative AI model is applied to the natural language generation engine used here.

[0799] Furthermore, the server uses an emotion engine to analyze emotions from the audio data. This process determines the user's emotions based on factors such as voice quality, tone, and speed, and adds this information to the recorded data. Emotions are also extracted from text data, and appropriate nuances are added to the text.

[0800] When a user takes childcare-related photos with their device, the image data is sent to a server. The server uses an image recognition algorithm to identify important events and people. The identified information is incorporated into the recorded data as visual information and stored together with text.

[0801] All data is stored in the cloud and synchronized across different devices. This allows users to easily access, view, and edit recorded data from their home PC or devices while on the go.

[0802] For example, if a user says, "I was so happy today because my child rode a bicycle for the first time," the emotion engine extracts the emotions of "joy" and "accomplishment" from this audio and reflects them in the recorded data. This generates emotionally rich sentences such as, "Today, my child rode a bicycle by themselves for the first time. I was so happy."

[0803] An example of a prompt message is: "Audio data: Today my child was happy to ride a bicycle for the first time. Emotion: Joy, sense of accomplishment. Please summarize this in a diary entry."

[0804] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0805] Step 1:

[0806] When a user begins speaking about childcare, the device uses its voice input device to record the audio as a digital signal in real time. The input data is the user's voice, which is then recorded and output as digital audio data. Smartphones and tablets are useful for this process.

[0807] Step 2:

[0808] The terminal transfers recorded audio data to the server using a secure communication protocol. The input here is digitized audio data, and the output is the secure transmission of data to the server. Appropriate encryption technology is used to ensure that the audio data reliably reaches the server.

[0809] Step 3:

[0810] The server converts the received audio data into text data using speech recognition software. The input is the audio data stored on the server, and the output is the text data generated by speech recognition. In this step, the audio signal is analyzed using a speech recognition API and the corresponding text is generated.

[0811] Step 4:

[0812] The server processes the generated text data using a natural language generation engine and generates conversational text as recorded data. The input is text data generated by speech recognition, and the output is text in natural language format. Text generation based on prompt sentences is performed using a generation AI model.

[0813] Step 5:

[0814] The server uses an emotion engine to analyze emotions from audio data and adds emotional information to the recorded data. The input here is audio and text data, and the output is recorded data with added emotional information. By analyzing the voice tone and speed, the server identifies the user's current emotions and reflects them in the recording.

[0815] Step 6:

[0816] The user sends image data captured with their device to the server. The input is the captured image data, and the output is the transmission of data to the server. This is done easily and quickly because it utilizes cloud-based data sharing.

[0817] Step 7:

[0818] The server uses image recognition algorithms to analyze transmitted image data and identify specific events or important individuals. The input is image data, and the output is recognized event information. Image content is filtered, tagged, and integrated into the recorded data.

[0819] Step 8:

[0820] The server stores all data in the cloud and synchronizes it with different data processing terminals. Input is the final recorded data, and output is storage in the cloud and data synchronization to different terminals. This allows users to access and edit data from anywhere.

[0821] (Application Example 2)

[0822] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0823] In childcare, there is a need for parents to record important moments without missing them and to reflect their emotions and thoughts in real time. Furthermore, there is a need for systems that reduce the emotional and cognitive burden on parents and support the childcare process. Conventional technologies have limited capabilities for automatic generation of childcare records and emotion analysis, failing to fully utilize the potential of multi-functional digital devices.

[0824] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0825] In this invention, the server includes means for inputting voice, means for converting voice into text data, means for generating diary data by performing natural language generation based on the text data, means for inputting images and analyzing images to identify specific events, means for analyzing emotions based on voice and adding emotional information to the diary data, means for analyzing the overall nature of the diary data and providing emotional and cognitive support using a new information processing device, and means for automatically generating records using multiple digital devices and saving childcare records from multiple perspectives. As a result, parents can richly record moments of childcare and confidently look back on emotions and events without requiring any special operation.

[0826] "Means of inputting voice" refers to a function used by an information processing device to accurately acquire voice data.

[0827] "Means of converting to text data" refers to the process of converting acquired audio data into text format.

[0828] "A means of generating diary data by performing natural language generation" refers to a technology that automatically creates a structured diary using natural language processing based on text data.

[0829] "A means of inputting an image, analyzing the image, and identifying a specific event" refers to an algorithm that accurately recognizes image data and identifies the events or situations captured within it.

[0830] "A means of analyzing emotions based on voice and adding emotional information to diary data" refers to the process of extracting emotions from voice data and integrating them into diary data.

[0831] A "new information processing device" is a device designed using the latest technology and possesses high data processing capabilities.

[0832] "A means of automatically generating records using digital devices and saving childcare records from multiple perspectives" refers to a method of automatically collecting and saving childcare-related information using multiple electronic devices.

[0833] This invention is a system that automatically generates childcare records and reduces the emotional and cognitive burden on users. This system integrates a voice input device, a voice processing server, an image recognition device, a cloud synchronization system, and an emotion analysis engine to support parents.

[0834] The user inputs conversations about childcare using a device equipped with speech recognition capabilities. The device sends the audio data to a cloud server, which uses the Google Cloud Speech-to-Text API to convert the audio into text data. The resulting text data is then processed using OpenAI's GPT model to generate a natural language, creating an emotionally rich childcare diary.

[0835] During this process, the Microsoft Azure Text Analytics API analyzes the tone and speed of the voice to extract the user's emotions. This emotional information is integrated into the diary data, allowing the user to later reflect on their feelings at that time. Additionally, photos taken by the user are analyzed by Amazon Rekognition to identify specific events and people. This ensures that the record includes visual elements as well.

[0836] All generated data is stored in the cloud using Firebase and can be synchronized with other digital devices. This allows users to view and edit their parenting records from various devices. The system enriches parents' parenting experience by recording daily moments of parenting in real time and providing emotional support.

[0837] For example, if a user voice-inputs "Today my child rode a bicycle for the first time," the system can analyze the content, extract "joy" and "sense of accomplishment" using its emotion engine, and reflect them in the diary. An example of a prompt would be, "My 3-year-old child sang a song by themselves for the first time. Please generate a record that reflects the joy and surprise the parent felt at that moment."

[0838] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0839] Step 1:

[0840] The user inputs voice using a device. The input voice data is temporarily stored on the device and then sent to a cloud server for speech recognition. Here, the voice is captured as data without any user intervention.

[0841] Step 2:

[0842] The server uses the Google Cloud Speech-to-Text API to convert the transmitted audio data into text data. It parses the audio data and converts its content into structured text format. The input here is audio data, and the output is text data.

[0843] Step 3:

[0844] The server inputs the acquired text data into an OpenAI GPT model and performs natural language generation. The generative AI model generates detailed and emotionally rich diary-style entries based on the input text. Here, the input is text data, and the output is diary entries.

[0845] Step 4:

[0846] The server uses the Microsoft Azure Text Analytics API to perform sentiment analysis on the audio data, extracting emotions from the user's tone and speed of voice. This analyzes the emotional state, and the resulting emotional information is directly added to the diary data. The input is audio data, and the output is data containing emotional information.

[0847] Step 5:

[0848] The user takes an image with their device, and the image data is sent to a cloud server. A server using Amazon Rekognition analyzes the image to identify specific events or people. The analysis results are converted into text format and integrated as visual information into the diary data. The input is image data, and the output is the analyzed information used in the diary.

[0849] Step 6:

[0850] The server stores the generated diary data using Firebase cloud storage and synchronizes it with other digital devices as needed. This allows users to view and edit their childcare records from any device. The input is all the data that makes up the diary, and the output is the diary data stored in cloud storage.

[0851] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0852] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0853] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0854] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0855] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0856] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0857] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0858] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0859] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0860] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values ​​representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values ​​representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0861] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0862] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0863] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0864] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0865] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0866] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0867] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0868] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0869] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0870] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0871] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0872] The following is further disclosed regarding the embodiments described above.

[0873] (Claim 1)

[0874] A means of inputting voice,

[0875] means for converting the aforementioned audio into text data,

[0876] A means for generating diary data by performing natural language generation based on the aforementioned text data,

[0877] A means for inputting an image and analyzing the image to identify a specific event,

[0878] A system including means for analyzing emotions based on the aforementioned voice and adding emotional information to the diary data.

[0879] (Claim 2)

[0880] The system according to claim 1, further comprising means for saving the aforementioned diary data on the cloud and synchronizing it with other devices.

[0881] (Claim 3)

[0882] The system according to claim 1, further comprising means for providing a user with an interface that allows editing of diary data based on the results of the analysis of the aforementioned audio and images.

[0883] "Example 1"

[0884] (Claim 1)

[0885] A device for inputting voice,

[0886] A device that converts the aforementioned audio into text information,

[0887] A device that generates text and records data based on the aforementioned text information,

[0888] A device that takes an image as input, analyzes the image, and identifies a specific activity,

[0889] A device for analyzing emotions based on the aforementioned voice and adding emotional information to the recorded data,

[0890] A system including a device for adding emotional nuances to recorded data generated based on the aforementioned emotional information.

[0891] (Claim 2)

[0892] The system according to claim 1, further comprising a function to store the recorded data on a remote storage medium and synchronize it with other information processing devices.

[0893] (Claim 3)

[0894] The system according to claim 1, further comprising a function to provide the user with an operation screen that allows editing of recorded data based on the analysis results of the aforementioned audio and image.

[0895] "Application Example 1"

[0896] (Claim 1)

[0897] A means of inputting voice,

[0898] means for converting the aforementioned audio into text data,

[0899] A means for generating diary data by performing natural language generation based on the aforementioned text data,

[0900] A means for inputting an image and analyzing the image to identify a specific event,

[0901] A means for analyzing emotions based on the aforementioned voice and adding emotional information to the aforementioned diary data,

[0902] A means for a mobile device to patrol a home and record audio and images in real time,

[0903] A means of sending the generated diary data to the cloud,

[0904] A means to make diary data accessible from a remote location,

[0905] A system that includes this.

[0906] (Claim 2)

[0907] The system according to claim 1, further comprising means for synchronizing diary data with other computing devices.

[0908] (Claim 3)

[0909] The system according to claim 1, further comprising means for providing a user with a display device capable of editing diary data based on the analysis results of the aforementioned audio and image.

[0910] "Example 2 of combining an emotion engine"

[0911] (Claim 1)

[0912] A device for inputting voice,

[0913] A device that converts the aforementioned audio into text data,

[0914] A device that generates recorded data by performing natural language generation based on the aforementioned character data,

[0915] A device that takes an image as input, analyzes the image, and identifies a specific event,

[0916] A device for analyzing emotions based on the aforementioned voice and adding emotional information to the recorded data,

[0917] A device that uses emotional information analyzed from speech to add appropriate nuances to recorded data,

[0918] A device that enhances recorded data as a visual element using the aforementioned image,

[0919] A system that includes this.

[0920] (Claim 2)

[0921] The system according to claim 1, further comprising a device for storing the aforementioned recorded data on an information management infrastructure and synchronizing it with other information processing terminals.

[0922] (Claim 3)

[0923] The system according to claim 1, further comprising a device that provides a user interface capable of editing recorded data based on the analysis results of the aforementioned audio and image.

[0924] "Application example 2 when combining with an emotional engine"

[0925] (Claim 1)

[0926] A means of inputting voice,

[0927] means for converting the aforementioned audio into text data,

[0928] A means for generating diary data by performing natural language generation based on the aforementioned text data,

[0929] A means for inputting an image and analyzing the image to identify a specific event,

[0930] A means for analyzing emotions based on the aforementioned voice and adding emotional information to the aforementioned diary data,

[0931] A means of analyzing the overall nature of the diary data and providing emotional and cognitive support by utilizing a new information processing device,

[0932] A system that includes means for automatically generating records using multiple digital devices and storing childcare records from various perspectives.

[0933] (Claim 2)

[0934] The system according to claim 1, further comprising means for saving the aforementioned diary data on the cloud and synchronizing it with other devices.

[0935] (Claim 3)

[0936] The system according to claim 1, further comprising means for providing a user with an interface that allows editing of diary data based on the results of the analysis of the aforementioned audio and images. [Explanation of Symbols]

[0937] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A means of inputting voice, means for converting the aforementioned audio into text data, A means for generating diary data by performing natural language generation based on the aforementioned text data, A means for inputting an image and analyzing the image to identify a specific event, A means for analyzing emotions based on the aforementioned voice and adding emotional information to the aforementioned diary data, A means for a mobile device to patrol a home and record audio and images in real time, A means of sending the generated diary data to the cloud, A means to make diary data accessible from a remote location, A system that includes this.

2. The system according to claim 1, further comprising means for synchronizing diary data with other computing devices.

3. The system according to claim 1, further comprising means for providing a user with a display device capable of editing diary data based on the analysis results of the aforementioned audio and image.