Information processing device, information processing method, program, and monitoring system

The information processing device addresses the impersonality of conventional voice synthesis by storing caregiver voice models and using GPUs and neural networks to deliver personalized messages, enhancing child comprehension and reducing anxiety.

JP2026109523APending Publication Date: 2026-07-01MIXI INC

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
MIXI INC
Filing Date
2025-08-12
Publication Date
2026-07-01

AI Technical Summary

Technical Problem

Conventional voice synthesis technologies in child-oriented monitoring systems produce mechanical and impersonal voices that fail to convey the nuances of a parent's love and concern, causing anxiety in children and hindering effective message delivery.

Method used

An information processing device that stores caregiver account information and a unique voice model, synthesizes non-voice messages into personalized voice messages, and transmits them to the child's terminal, utilizing GPUs for efficient speech synthesis and neural networks to replicate the caregiver's voice quality.

Benefits of technology

Children receive messages in a familiar voice, reducing anxiety and enhancing message understanding, with the system continuously improving voice quality through learning and adapting to the caregiver's voice over time.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026109523000001_ABST
    Figure 2026109523000001_ABST
Patent Text Reader

Abstract

In a monitoring service, we provide a system information processing device and a monitoring system that improve the quality of voice messages played back on the terminal used by the person being monitored. [Solution] In the monitoring system, the information processing device 10 includes a storage unit 11 that stores the caregiver's account information and the voice model 50 in association with each other, a receiving unit 12 that receives non-voice messages 52 and account information from the caregiver's terminal 20, a voice message generation unit 13 that synthesizes the non-voice messages 52 into voice and generates a first voice message 53 by applying the voice model 50 associated with the account information, and a transmitting unit 14 that transmits the first voice message 53 to the caregiver's terminal 30. A voice message is generated that reflects the caregiver's unique voice quality. [Effect] Children can receive messages in their parents' familiar voices, eliminating the anxiety associated with traditional robotic voices.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present disclosure relates to voice message generation technology in a monitoring service, and particularly to improving the quality of voice messages played on a monitored person's terminal.

Background Art

[0002] Conventionally, in a child-oriented monitoring GPS terminal, there is a known function of text-to-speech (TTS) synthesis of text messages sent from a guardian's smartphone and playing them on the terminal side. For example, Patent Document 1 describes a system that can select and transmit a message as voice or a text message.

[0003] In conventional monitoring services, there is an implemented function of converting a text message sent from a guardian into voice using a general-purpose voice synthesis engine and playing it on a child's terminal. For this voice synthesis, standard TTS (Text-to-Speech) technology is used, and voice is mechanically generated based on a pre-prepared acoustic model.

Prior Art Documents

Patent Documents

[0004]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0005] However, the voices generated by conventional speech synthesis technology are mechanical and impersonal, which can cause anxiety in children receiving messages and prevent the intended meaning of the message from their parents from being properly conveyed. Even if parameters such as speaking speed and pitch are adjusted in conventional speech synthesis technology, the lack of a speaker-like quality means that children still perceive it as a "message from a machine," making it difficult to convey the delicate nuances of a parent's love and concern. [Means for solving the problem]

[0006] To solve the above problems, the information processing device of the present disclosure includes: a storage unit that stores account information of a caregiver and a voice model unique to the caregiver in association with each other; a receiving unit that receives a non-voice message input on the caregiver terminal and the account information of the caregiver who input the non-voice message; a voice message generation unit that synthesizes the non-voice message into voice by applying the voice model associated with the account information to generate a first voice message; and a transmitting unit that transmits the first voice message to the caregiver terminal. [Effects of the Invention]

[0007] According to this disclosure, children can receive voice messages in a familiar voice of their guardian, thus eliminating the anxiety associated with traditional robotic voices and allowing them to understand the message content with peace of mind. [Brief explanation of the drawing]

[0008] [Figure 1] A diagram showing the overall configuration of the monitoring system according to the embodiment of this disclosure. [Figure 2] Block diagram showing an example of the hardware configuration of an information processing device according to the embodiments of this disclosure. [Figure 3] A diagram showing the functional block configuration of an information processing device according to an embodiment of this disclosure. [Figure 4] Block diagram showing an example of the hardware configuration of a monitoring terminal according to the embodiment of this disclosure. [Figure 5] This figure shows the functional block configuration of a monitoring terminal according to an embodiment of this disclosure. [Figure 6] This figure shows an example of the main screen of a monitoring terminal according to an embodiment of this disclosure. [Figure 7] This figure shows an example of a history confirmation screen for a monitoring terminal according to an embodiment of this disclosure. [Figure 8] This figure shows an example of a message sending screen for a monitor terminal according to an embodiment of this disclosure. [Figure 9] A diagram showing the voice expression parameter setting screen according to an embodiment of this disclosure. [Figure 10] Block diagram showing an example of the hardware configuration of a monitored terminal according to the embodiments of this disclosure. [Figure 11] This figure shows the functional block configuration of a monitoring terminal according to an embodiment of the present disclosure. [Figure 12] This figure shows an example of the UI screen of the administrator dashboard according to the embodiments of this disclosure. [Figure 13] A flowchart showing the main processing flow according to an embodiment of this disclosure. [Figure 14] A flowchart illustrating the speech learning processing flow according to an embodiment of this disclosure. [Figure 15] Flowchart showing the initial setup and voice registration processing flow according to the embodiment of this disclosure [Figure 16] A flowchart illustrating the message reception and playback processing flow according to the embodiments of this disclosure. [Figure 17] Flowchart showing the voice reply and transmission processing flow according to the embodiments of this disclosure [Figure 18] Flowchart showing the processing flow of the proactive suggestion function according to the embodiments of this disclosure [Figure 19] Flowchart showing the processing flow of the emergency voice enhancement function according to the embodiment of this disclosure. [Figure 20] Flowchart showing the processing flow of the automatic retransmission function according to the embodiment of this disclosure [Figure 21] Proactive Proposal Function Operation Sequence Diagram [Figure 22] Diagram showing a configuration example for multiple guardians [Figure 23] Conceptual diagram of environment-adaptive sound adjustment [Figure 24] Explanatory diagram of child profile corresponding processing [Figure 25] Operation diagram of emergency voice emphasis function [Figure 26] Sequence diagram of automatic retransmission function [Figure 27] Block diagram of voice quality improvement processing [Figure 28] Diagram showing types and data formats of non-voice messages [Figure 29] Diagram showing an example of an account management screen of a guardian terminal according to an embodiment of the present disclosure

Modes for Carrying Out the Invention

[0009] Hereinafter, embodiments of the present disclosure will be described in detail with reference to the drawings. Note that the following embodiments do not limit the present disclosure, and not all of the components described in the embodiments and their combinations are essential.

[0010] First, the overall configuration of the monitoring system according to the present embodiment will be described. FIG. 1 is a diagram showing the overall configuration of a monitoring system 1 according to the first embodiment of the present disclosure. The monitoring system 1 includes an information processing device 10 (server device), a plurality of guardian terminals 20, a plurality of ward terminals 30, and a network 40 (Internet) connecting these.

[0011] The information processing device 10 may be composed of a single server computer, or may be configured as a distributed system in which a plurality of server computers cooperate. It may also be realized in a cloud computing environment. The guardian terminal 20 is a smartphone, a tablet terminal, a personal computer, etc. respectively held by a plurality of guardians such as a guardian P1 (mother), a guardian P2 (father), and a guardian P3 (grandparent). The ward terminal 30 is a dedicated GPS terminal carried by children C1, C2, etc.

[0012] The primary purpose of this monitoring service is to ensure the safety of children when they are going to and from school or when they are out. As a basic operation of the service, the monitored terminal 30 periodically (e.g., every 1-3 minutes) transmits its location information to the information processing device 10. Parents can check their child's current location in real time on a map through a dedicated application on the monitoring terminal 20. Furthermore, notifications are automatically sent to parents when the child arrives at or departs from a pre-set specific area (home, school, cram school, etc.).

[0013] The present invention is particularly relevant to the messaging function in this monitoring service. Parents can send messages to their children from the monitor terminal 20, and the children receive these messages on the monitored terminal 30. Conventionally, these messages were played back as mechanical voices using general-purpose speech synthesis technology, but with the present invention, they are played back as natural voices that reflect the parent's own voice quality.

[0014] Figure 2 is a block diagram showing an example of the hardware configuration of the information processing device 10 according to this embodiment. The information processing device 10 is implemented, for example, as a general-purpose server computer.

[0015] The information processing device 10 includes a processor 10a (CPU) that comprehensively controls the operation of the entire device, memory 10b (RAM) which is the main memory, auxiliary storage device 10c (SSD, HDD, etc.) which stores the OS and programs, a communication interface 10d (NIC, etc.) for connecting to the network 40, and a bus 10f which connects each part.

[0016] In particular, in this embodiment, in order to perform high-quality speech synthesis processing at high speed, it is preferable that the information processing device 10 is equipped with one or more GPUs (Graphics Processing Units) 10e specialized for parallel processing. The GPUs 10e are used to efficiently perform calculations on the speech synthesis model using a neural network by the speech message generation unit 13 (described later).

[0017] The auxiliary storage device 10c stores programs for implementing each of the functional units described later, as well as the caregiver's account information, voice model, and various historical data. These programs are loaded into memory 10b by the processor 10a and executed, thereby realizing the various functions described later.

[0018] Figure 3 is a diagram showing the functional block configuration of the information processing device 10 in detail. The information processing device 10 has the function of performing the core processing of the monitoring system 1. The information processing device 10 comprises a storage unit 11, a receiving unit 12, a voice message generation unit 13, a transmission unit 14, a UI control unit 15, a voice learning unit 16, and an analysis unit 17. Each of these units is a functional unit realized by the execution of a program by the processor 10a.

[0019] The memory unit 11 stores the caregiver's account information 111, voice model 50, the person being cared for's profile information 112, message history 113, and location information history 114. The account information 111 includes the caregiver's identification ID, name, contact information, and facial photograph.

[0020] The receiving unit 12 receives various data (non-voice messages, voice messages, location information, account information, voice expression parameters, etc.) transmitted from the caregiver terminal 20 and the monitored person terminal 30. The voice message generation unit 13 generates voice messages that reflect the caregiver's voice quality based on the received information. The transmission unit 14 transmits the generated voice messages and various control signals to the corresponding terminals. The UI control unit 15 controls the UI screen displayed on the caregiver terminal 20 and suggests messages according to the situation. The voice learning unit 16 generates and updates the voice model 50 based on the voice data transmitted from the caregiver. The analysis unit 17 analyzes the voice messages transmitted from the child and their surrounding environment. Detailed operation of each functional block will be described later.

[0021] Figure 4 is a block diagram showing an example of the hardware configuration of the monitoring terminal 20 according to this embodiment. The monitoring terminal 20 can be implemented as, for example, a smartphone, a tablet device, or a personal computer.

[0022] The monitoring terminal 20 includes a processor 20a, memory 20b, auxiliary storage device 20c (flash memory, etc.), wireless communication unit 20d (5G / LTE / Wi-Fi / Bluetooth, etc.), display unit 20e (organic EL display, liquid crystal display, etc.), and a touch panel 20f integrated with the display unit 20e.

[0023] Furthermore, the caregiver terminal 20 includes a voice input unit 20g (microphone) for inputting voice messages (third voice message 55) and voice commands from the guardian, a voice output unit 20h (speaker) for outputting various sounds, a location information positioning unit 20i (GPS receiver) for determining the current location, and an imaging unit 20j (camera) used for user authentication, etc.

[0024] The auxiliary storage device 20c has a dedicated application installed for using the monitoring service of this embodiment. The processor 20a executes this dedicated application, thereby realizing the various functions described later.

[0025] Figure 5 shows the functional block configuration realized by the dedicated application installed on the monitor terminal 20.

[0026] The monitoring terminal 20 includes, as functional units, a display control unit 251, an operation reception unit 252, a message construction unit 253, a voice recording unit 254, a communication control unit 255, a location information display unit 256, and a profile management unit 257.

[0027] The display control unit 251 displays various UI screens, described later, on the display unit 20e. The operation reception unit 252 receives user input via the touch panel 20f. The message construction unit 253 constructs a non-voice message 52 to be sent to the information processing device 10 based on the user's input (text, selection of a standard message, selection of a stamp). The voice recording unit 254 records the parent's voice input from the voice input unit 20g and generates a third voice message 55. The communication control unit 255 sends and receives various data to and from the information processing device 10 via the wireless communication unit 20d. The location information display unit 256 displays the child's location information received from the information processing device 10 on a map. The profile management unit 257 sets and manages the parent's own account information and the profile information of the child being monitored.

[0028] Figure 10 is a block diagram showing an example of the hardware configuration of the monitored person terminal 30 according to this embodiment. The monitored person terminal 30 is a dedicated terminal designed to be small, lightweight, and robust so that a child can easily carry it.

[0029] The monitored terminal 30 includes a processor (CPU, MCU, etc.) 30a that controls the entire terminal, an auxiliary storage device (flash memory, etc.) 30b that stores firmware and data, a communication module 30c (e.g., LPWA such as LTE-M or NB-IoT) for connecting to the network 40, and a GPS receiver 30d that determines its own location.

[0030] The device also includes an audio output unit 31 (speaker) for playing back the first voice message 53 received from the information processing device 10, an audio input unit 32 (microphone) for recording the child's reply voice (second voice message 54), physical operation buttons 30e for various operations, and a small display 35 (small LCD or LED indicator, etc.) for simply displaying the time and status. Furthermore, it has a vibration motor 30f for generating vibrations for notifications, and a power supply unit 30g (rechargeable battery) for supplying power to the terminal. The size of the terminal is, for example, about 50mm wide x 50mm high x 21mm thick, and it weighs about 60g.

[0031] Figure 11 shows the functional block configuration of the monitored person terminal 30. The monitored person terminal 30 realizes various functions by having the CPU 30a execute firmware stored in the flash memory 30b.

[0032] The functional units include a control unit 311, a location information positioning unit 312, a communication unit 313, a voice processing unit 314, an input processing unit 315, a notification unit 316, and a power management unit 317.

[0033] The control unit 311 controls the operation of the entire terminal. The location information positioning unit 312 controls the GPS receiver 30d to periodically acquire location information. The communication unit 313 transmits and receives data with the information processing device 10 via the communication module 30c. The voice processing unit 314 processes the playback of received voice data and the recording of voice input from the microphone. The input processing unit 315 detects when the operation button 30e is pressed and requests the control unit 311 to perform the corresponding processing (SOS notification, start of reply recording, etc.). The notification unit 316 controls the display on the small display 35, the lighting and flashing of the LED, and the driving of the vibration motor 30f. The power management unit 317 monitors the battery level and controls the power supply to each functional unit to enable long-term operation.

[0034] When the dedicated application is launched on the monitor terminal 20, the main screen 700 shown in Figure 6 is displayed.

[0035] The main screen 700 displays a child selection tab 701 for selecting the child to be monitored (in this example, "Hanako"), a status display area 702 showing the child's current status (e.g., "On their way home from school"), and a map display area 703 showing the child's current location in real time. At the bottom of the screen are a message send button 704 for sending messages, a history check button 705 for reviewing past interactions, and a settings button 706 for making various settings.

[0036] When a parent taps the history confirmation button 705, the screen transitions to the history confirmation screen 710 shown in Figure 7. The history confirmation screen 710 displays the history of messages sent and received between the parent and child in chronological order. Each message displays the sender icon 711, the message content (in text format) 712, and the time it was sent 713. Voice messages are accompanied by a play button 714, which can be tapped to listen to them again at any time.

[0037] Tapping the message send button 704 on the main screen 700 will take you to the message send screen 800 shown in Figure 8.

[0038] The message sending screen 800 includes a text input field 801 for entering free-form text, a list of pre-set messages 802 for selecting frequently used messages with a single tap, and a stamp selection area 803 for expressing emotions. A voice recording button 804 is also provided for directly recording and sending voice messages.

[0039] Parents should prepare the message they wish to send using one of the following methods: (a) Text input: Enter any text, such as "Please contact me when you get home from school," into text input field 801 using the keyboard. (b) Select a pre-set message: Tap to select the desired message, such as "Good luck!", from the list of pre-set messages 802. (c) Stamp selection: From stamp selection area 803, tap to select a stamp such as clapping or a heart.

[0040] After preparing the message content, parents can adjust the voice tone. Tapping the "Set Voice Tone" button 805 on the message sending screen 800 will bring up the voice expression parameter setting screen 900 shown in Figure 9.

[0041] As shown in Figure 9, the voice expression parameter setting screen 900 is equipped with input elements such as an emotion selection button 601, a speech speed adjustment slider 602, and a volume adjustment bar 603.

[0042] Parents can tap emotion selection buttons 601, such as "gentle" or "encouraging," adjust the speaking speed by moving the speaking speed adjustment slider 602 left or right, and adjust the volume using the volume adjustment bar 603. This allows parents to intuitively convey the emotions and nuances they want to communicate through voice.

[0043] Once the settings are complete, tap the "Settings" button 904 to return to the message sending screen 800, and finally tap the "Send" button 806. This sends the entered non-voice message (text, standard message ID, stamp ID), account information, and configured voice expression parameters to the information processing device 10.

[0044] The monitored device 30 has a simple interface so that children can operate it intuitively. (a) Power ON / OFF: Turn the power on or off by pressing and holding the power button on the side. (b) Message reception and playback: When a message arrives from a parent, the device vibrates or emits a notification sound and the LED flashes. Pressing the central play button plays the message in the parent's voice. (c) Voice Reply: After the message is played, or at any time, the microphone is activated while the reply button is pressed and held. The child can record a reply message by speaking into the device while holding down the button. When the button is released, the recording ends and is automatically sent to the parent / guardian. (d) Emergency SOS notification: When the emergency SOS button located on the side is pressed and held for more than 3 seconds, an emergency notification is automatically sent to the guardian's monitoring terminal 20 and pre-registered contacts, along with the current location information. At this time, the terminal also has the function of automatically recording ambient sounds and sending them to the information processing device 10.

[0045] The information processing device 10 provides a web browser-based management dashboard for system administrators. Figure 12 shows an example of the UI screen of the administrator dashboard.

[0046] The administrator dashboard 1200 displays various metrics that provide an overview of the system's operating status. For example, it includes a user statistics area 1201 showing the current total number of users and active users, a server resource monitoring area 1202 showing the CPU and GPU usage and memory usage of the information processing device 10, a processing performance area 1203 showing the number of messages processed per unit time and the average response time, and an error monitoring area 1204 showing the error rate and type of errors. Through this screen, administrators can monitor the health of the service and, if necessary, increase server resources or investigate the cause of failures.

[0047] Figure 29 shows an example of the account management screen 1300 displayed on the monitor terminal 20. Parents can access this account management screen 1300 by tapping the settings button 706, etc., located on the main screen 700 (Figure 6).

[0048] The account management screen 1300 displays a list 1301 of all accounts registered as caregivers for a single person being cared for (a child). Each account entry displays the caregiver's icon 1302, name, and relationship (e.g., "Grandma") 1303.

[0049] Furthermore, each account item is provided with a status display section 1304 that shows the registration status of the voice model. For example, if a voice model is registered, the text "Registered" and a checkmark icon are displayed, and if it is not registered, the text "Not Registered" and a warning icon are displayed. This allows the primary caregiver to see at a glance whose messages will be played in their own voice and whose messages will be played in a generic voice. This information is also useful when explaining to a child in advance, for example, "Grandma hasn't registered her voice yet, so her messages will be delivered in a machine voice."

[0050] Next to accounts whose status is "Not Registered," an action button (e.g., "Request Registration" button 1305) is placed to request registration.

[0051] When a parent taps the "Request Registration" button 1305, the caregiver terminal 20 sends the operation information (including the identification information of the target account) to the information processing device 10. Upon receiving this information, the UI control unit 15 of the information processing device 10 instructs the transmission unit 14 to send a notification (push notification, email, etc.) to the contacts associated with the target account (email address or caregiver terminal used by the parent) to encourage registration of the voice model. The notification may include a link to directly transition to the voice registration screen.

[0052] This series of features will encourage the registration of voice models by the entire family and all stakeholders, increasing the likelihood that users will be able to enjoy the core value of this system—receiving messages in a familiar voice—in all forms of communication.

[0053] Figure 13 is a flowchart showing the main processing flow in this embodiment. Each process will be described in detail below.

[0054] The guardian terminal 20 executes a process to receive message input from the guardian (S1301). This input process includes text input, selection of a pre-set message, or selection of a stamp. For example, it is configured to accept a text message such as "Let me know when you get home from school" via keyboard input, a pre-set message such as "Good luck!" selected from a pull-down menu, and a clapping stamp selected by tapping.

[0055] Furthermore, the UI control unit 15 of the information processing device 10 displays a screen for setting voice expression parameters on the caregiver terminal 20 (S1302). This screen display may be performed before, during, or after message input. The caregiver selects an appropriate emotional tone and speaking speed, taking into account the content of the message to be sent and the child's situation.

[0056] Subsequently, the communication unit of the monitor terminal 20 transmits the input non-voice message 52, account information, and voice expression parameters 51 to the information processing device 10 (S1303). The communication unit uses a secure communication protocol such as HTTPS and encrypts the data.

[0057] The receiving unit 12 of the information processing device 10 receives the transmitted data (S1304). The receiving unit 12 is equipped with a checksum verification function to verify the integrity of the data and a retransmission request function in case of an error.

[0058] The voice message generation unit 13 searches the storage unit 11 based on the received account information and obtains the corresponding voice model 50 (S1305). Even if multiple caregivers are registered, the system is configured to uniquely identify the voice model using the account ID as the key.

[0059] Next, the voice message generation unit 13 applies the acquired voice model 50 and voice expression parameters 51 to generate a first voice message 53 from the non-voice message 52 (S1306). The voice synthesis process includes the stages of text analysis, prosodic generation, acoustic feature generation, and waveform generation. At each stage, the voice model 50 and voice expression parameters 51 are applied to reflect the parent's voice quality and specified emotional expression.

[0060] The transmitting unit 14 transmits the generated first voice message 53 to the monitored person's terminal 30 (S1307). The transmitting unit 14 is equipped with a push notification function to enable real-time message delivery.

[0061] The monitored person's terminal 30 plays the received first voice message 53 from the voice output unit 31 (S1308). The voice output unit 31 is equipped with an acoustic processing function that takes into account the auditory characteristics of children. As a result, children can hear messages in the familiar voice of their guardian, and the psychological resistance to mechanically synthesized voices is eliminated.

[0062] Figure 14 is a flowchart of the voice learning processing flow. The voice learning unit 16 has the function of continuously improving the parent's voice model.

[0063] In step S1401, the caregiver terminal 20 has the function of receiving voice messages (third voice message 55) recorded by the guardian. Recording is performed by the voice recording option when sending a normal message, or by voice registration during initial setup.

[0064] In step S1402, the receiving unit 12 has the function of receiving the third voice message 55 and forwarding it to the voice learning unit 16. The receiving unit 12 checks the quality of the voice data and filters out low-quality data that is unsuitable for learning.

[0065] In step S1403, the speech learning unit 16 has the function of extracting acoustic features from the received speech data. The extracted features include the time series of the MFCC, the fundamental frequency trajectory, the power envelope, and so on.

[0066] In step S1404, the speech learning unit 16 has the function of updating the existing speech model 50 using the extracted features. Transfer learning technology is used for the update, enabling effective model improvement even with a small amount of additional data.

[0067] In step S1405, the memory unit 11 has the function of saving the updated voice model 50. This results in the effect that the voice quality improves the longer the system is used, and more natural and parental voices are generated.

[0068] Figure 15 is a flowchart showing the initial setup and voice model registration process when a parent uses the application for the first time on the monitoring terminal 20.

[0069] First, the parent launches the application and registers an account or logs in (S1501). Next, the parent pairs the child's monitoring device 30 with the monitoring device 20 (S1502). Pairing is performed, for example, by reading a 2D barcode displayed on the child's monitoring device 30 with the camera on the monitoring device 20.

[0070] Once pairing is complete, the system prompts the user to register the voice model (S1503). The parent takes turns reading aloud several pre-written phrases displayed on the screen (e.g., "Hello, we'll deliver a message in your voice," "Let's play in the park," etc.) (S1504). The voice recording unit 254 records this spoken voice and sends it to the information processing device 10 as a third voice message 55 (S1505).

[0071] The voice learning unit 16 of the information processing device 10 extracts the characteristics of the parent's voice from multiple received voice data and generates an initial voice model 50 (S1506). The generated voice model 50 is stored in the storage unit 11 in association with the parent's account information 111 (S1507). This completes the initial setup (S1508).

[0072] Figure 16 is a flowchart showing the processing flow when the monitored terminal 30 receives and plays an audio message from the information processing device 10.

[0073] The monitored terminal 30 waits for a push notification from the information processing device 10 while in standby mode (S1601). When a push notification is received (S1601: Yes), the device starts processing the reception of the first voice message 53 (S1602).

[0074] As soon as streaming playback of the received audio data begins (S1603), the vibration motor and LED are used to notify the child of the arrival of the message (S1604). Once audio playback is complete (S1605: Yes), the notification stops (S1606) and the device returns to standby mode (S1607).

[0075] Figure 17 is a flowchart showing the processing flow when a child makes a voice reply using the monitored terminal 30.

[0076] The system monitors whether the child has pressed the reply button 30e (S1701). If the button is pressed (S1701: Yes), the voice input unit 32 (microphone) is activated and voice recording begins (S1702).

[0077] Recording continues as long as the child keeps pressing the button (S1703). When the button is released (S1703: No), recording ends (S1704), and the recorded audio data is generated as the second voice message 54.

[0078] The generated second voice message 54 is transmitted to the information processing device 10 via the communication unit 313 (S1705). After transmission is complete, processing is terminated (S1706).

[0079] Figure 18 shows the processing flow for the proactive suggestion function, Figure 19 for the emergency voice enhancement function, and Figure 20 for the automatic retransmission function. These functions are primarily executed by the information processing device 10.

[0080] In the proactive suggestion process (Figure 18), the child's location and time information is periodically acquired (S1801, S1802), and the situation is determined (S1803). If a specific situation (e.g., near school at dismissal time) is met (S1804: Yes), a message and voice expression appropriate to the situation are suggested (S1805) and sent to the monitor terminal 20 (S1806).

[0081] In emergency voice enhancement processing (Figure 19), the urgency level is determined from the content of the message to be generated (S1903). If the urgency level is high (S1904: Yes), voice enhancement processing is applied to adjust the volume and speaking speed (S1905) before generating the voice message (S1906).

[0082] In the automatic resend process (Figure 20), after receiving the child's reply voice (second voice message) (S2002), the content is analyzed (S2003), and if it is determined that the message was not understood (e.g., "Say that again") (S2004: Yes), the previous message is adjusted to be easier to understand in terms of expression and voice (slower, clearer, etc.) (S2005) and automatically resent (S2006).

[0083] In this specification, the term "first voice message" refers to any message synthesized to reflect the voice characteristics of the parent or guardian. Depending on its source, this first voice message mainly includes the following two forms: Messages based on parental input: A voice message generated by the information processing device 10 using the parent's voice model, based on non-voice messages such as text and stamps entered by the parent using the monitoring terminal 20. Messages generated automatically by the system: Based on the child's situation information (location information, behavioral data, etc.) obtained from the monitored person's terminal 30 and linked external information, the information processing device 10 autonomously determines and generates the content of a message to be notified to the child, and then voices that message using a parent's voice model. This expanded concept enables detailed monitoring through parental voices, even in situations where parents are not directly involved.

[0084] The memory unit 11 stores the caregiver's account information 111 and the voice model 50 in association. The "voice model" here refers to a set of parameters or a machine learning model for reproducing the voice quality of a specific speaker, and its internal representation is not limited. As an example, the voice model 50 is composed of at least one of the following: a set of parameters that express the acoustic characteristics of the speaker, information that expresses the speech style and prosodic pattern, or a latent representation acquired by machine learning.

[0085] The aforementioned "parameters that express acoustic characteristics" more specifically include the mean value and variation range of the fundamental frequency (F0), the first to third formant frequencies, spectral slope, vocal tract length parameter, glottal sound source parameter, etc. The aforementioned "information that expresses speech style and prosodic patterns" includes, for example, the frequency of accent occurrence, the distribution of pause lengths, the mean and variance of speech rate, and the characteristics of phonological duration, etc. The aforementioned "latent representations acquired by machine learning" include, for example, feature vectors in the intermediate layers of a neural network, latent variables of a VAE (Variational Autoencoder), speaker embedding vectors, etc.

[0086] The voice model 50 is constructed using deep learning technology. Specifically, it employs the latest neural speech synthesis technologies such as WaveNet, Tacotron2, and FastSpeech2 to accurately learn and reproduce the characteristics of the parent's voice. This makes it possible to reproduce even the subtle voice quality characteristics unique to the speaker, which are difficult to achieve with simple parameter adjustments, resulting in a level of naturalness that allows children to recognize their parent's voice without any sense of incongruity.

[0087] The speech expression parameters 51 are information that controls the expressiveness of the generated speech message. The term "speech expression parameters" here refers to information for controlling the expressiveness of the generated speech message, and its format is not limited. As an example, the speech expression parameters 51 consist of at least one of the following: "information representing the type and intensity of emotion," "numerical information that controls prosodic characteristics," or "information representing the intent of speech or the purpose of communication."

[0088] The aforementioned "information representing the type and intensity of emotion" more specifically includes emotion categories such as "gentle," "stern," "encouraging," "worried," "joyful," and "sad," and their intensity expressed as a continuous value from 0 to 100. The aforementioned "numerical information controlling prosodic characteristics" includes, for example, parameter values ​​such as speech speed (0.5x to 2.0x), volume (-10dB to +10dB), and voice pitch (-200 cent to +200 cent). The aforementioned "information representing the intent of speech or the purpose of communication" includes, for example, speech act types such as "persuasion," "agreement," "instruction," "question," and "confirmation."

[0089] As shown in Figure 9, the UI control unit 15 has a function to display an operation screen for setting voice expression parameters 51 on the caregiver terminal 20. The operation screen is arranged with input elements such as emotion selection buttons 601, speech speed adjustment sliders 602, and volume adjustment bars 603. This allows caregivers to intuitively reflect the emotions and nuances they want to convey in their voice, and to communicate even subtle feelings that are difficult to express through text alone.

[0090] Figure 22 shows an example of a configuration that supports multiple guardians. The memory unit 11 has the function of storing the account information of multiple guardians and the corresponding voice models for each.

[0091] For example, if a child C1 has a mother P1, a father P2, and a grandmother P3 registered as caregivers, the memory unit 11 manages voice models 501 for P1, 502 for P2, and 503 for P3 individually. The voice message generation unit 13 selects and applies the corresponding voice model based on the sender account information at the time of message transmission.

[0092] The display 35 of the monitored terminal 30 has the function of displaying the icon and photo of the message sender. The displayed photo is based on image data contained in the account information 111. This allows the child to identify the sender both visually and aurally, and enables them to communicate appropriately without becoming confused, even when receiving messages from multiple guardians.

[0093] The monitored person's terminal 30 has a function to record the child's voice from the voice input unit 32 and transmit it as a second voice message 54. The voice input unit 32 is equipped with a noise-canceling function to suppress ambient noise.

[0094] The analysis unit 17 has the function of analyzing the acoustic characteristics of the received second voice message 54 and estimating the child's emotional state. The analysis unit 17 estimates emotional categories such as "energetic," "tired," "sad," and "excited" from acoustic parameters such as fundamental frequency fluctuations, speech rate, volume level, and spectral centroid. More specifically, the analysis unit 17 is equipped with a pre-trained emotion recognition model consisting of a classifier using a support vector machine and deep learning. This model is pre-trained using a large-scale voice database with emotion labels, and estimates the emotional category with a predetermined probability from the acoustic characteristics of the input second voice message 54.

[0095] The UI control unit 15 has the function of suggesting recommended values ​​for voice expression parameters when sending the next message, based on the estimated emotional state of the child. For example, if the child's voice is estimated to sound "tired," it recommends sending the message in an "encouraging" tone. This allows the system to detect subtle emotional changes that a parent might directly perceive from the child's voice, conveying the child's psychological state to the parent who is in a remote location and prompting an appropriate response.

[0096] The information processing device 10 has a function to automatically generate and transmit guidance in the guardian's voice when it determines that a warning or guidance is necessary based on the child's situation information transmitted from the monitored person's terminal 30. The situation information here includes location information obtained from the location information positioning unit 312 and behavioral data (running, standing still, falling, etc.) obtained from an acceleration sensor (not shown) built into the terminal.

[0097] For example, the UI control unit 15 or analysis unit 17 detects when a child deviates from a pre-set safety area (geofence) or when the acceleration sensor detects the child moving in an area where danger is anticipated, such as near a road. Upon detection, the UI control unit 15 automatically generates a message such as "Please return to a safe place" or "It's dangerous, let's walk." The voice message generation unit 13 then synthesizes this message using a voice model of the child's guardian and generates it as a first voice message. The generated message is transmitted to the monitored person's terminal 30 via the transmission unit 14 and played back through the speaker.

[0098] This allows children to receive direct warnings in the familiar voice of their guardian, rather than just a warning sound or a machine voice. As a result, they are more likely to intuitively understand dangerous situations and readily follow instructions.

[0099] The information processing device 10 has a function to link with external emergency information servers, such as J-Alert and disaster prevention and crime prevention information distributed by local governments. When the receiving unit 12 receives emergency information related to the current location of the person being monitored (e.g., earthquake warning, suspicious person information), the UI control unit 15 converts that information into a simple expression that even a child can understand. For example, it converts information such as "magnitude 5.0 earthquake occurred" into a specific action instruction such as "A big earthquake is coming! Protect your head!"

[0100] Next, the voice message generation unit 13 synthesizes the converted message using the parent's voice model to generate an emergency first voice message. This message is sent to the monitored person's terminal 30 with the highest priority. As a result, the child receives specific evacuation instructions in the voice of their most trusted parent, rather than an impersonal alarm sound that could induce panic, dramatically increasing the likelihood that they will remain calm and take appropriate initial action.

[0101] The information processing device 10 has the function of understanding the guardian's situation by obtaining permission to access the guardian's calendar and schedule information via the application on the guardian's terminal 20. When a message is sent from the monitored person's terminal 30 while the guardian is in a situation where immediate response is difficult, such as "in a meeting" or "driving," the UI control unit 15 detects this.

[0102] The system then automatically generates a message explaining the parent's situation, such as, "Dad can't answer the phone right now. I'll call you back later." The voice message generation unit 13 synthesizes this message using the parent's voice model and automatically sends it back to the child's device as the first voice message. This allows the child to understand the parent's situation and feel reassured without feeling ignored. The use of the parent's voice creates a level of trust and warmth that is incomparable to a simple automated response from a system, effectively preventing misunderstandings in parent-child communication and reducing anxiety in the child.

[0103] Figure 21 is an operation sequence diagram of the proactive suggestion function. The UI control unit 15 has the function of generating message suggestions appropriate to the situation based on the location information and time information of the monitored person's terminal 30.

[0104] The system estimates a child's current activity level based on a combination of location and time information. The estimation uses a rule-based logic, such as the following:

[0105] If a student is more than 500m away from school at 5:00 PM on a weekday, it is suggested that the system determine they are "on their way home from school" and send a message in a "gentle" tone saying, "It's almost time to go home." If a student is in a park area at 2:00 PM on a weekend, it is suggested that the system determine they are "playing" and send a message in a "cheerful" tone saying, "Are you having fun? Don't forget to stay hydrated."

[0106] The UI control unit 15 has the function of sending the generated suggestion as a push notification to the caregiver's terminal 20. The notification screen displays the suggestion message, recommended voice expression parameters, and a one-tap send button. After reviewing the suggestion, the caregiver can either make modifications as needed or send it as is.

[0107] The accuracy of the suggestions continuously improves by analyzing past message sending history and children's behavior patterns using machine learning. This results in more accurate message suggestions that are adapted to each family's daily routine. In particular, simple text notifications or suggestions using machine-generated voices may not be taken seriously by children or may be perceived as mere instructions. In contrast, when children are called out to with a familiar voice similar to that of a parent saying, "It's almost time to go home," they feel the presence of their parent more strongly and are more likely to accept the suggestion. This combination of technologies produces an unpredictable and remarkable effect.

[0108] Figure 28 shows the types and data formats of non-voice messages. Non-voice messages 52 include three types: text data, standard message identifiers, and stamp identifiers. All non-voice messages are ultimately converted to audio and played back on the monitored person's terminal.

[0109] The pre-set message feature allows users to register frequently used messages in advance and manage them using identifiers. The system has the following pre-set messages: "Welcome home" (ID:001), "Have you finished your homework?" (ID:002), "I'll be picking you up soon" (ID:003), "Are you hungry?" (ID:004), "Go to bed early" (ID:005).

[0110] Each pre-set message has default voice expression parameters associated with it. For example, "Welcome home" is set to "gentle and slow," and "Go to bed early" is set to "slightly strict and normal speed." Parents can also add and register custom pre-set messages. Since pre-set messages are managed on the server side, management such as multilingual support and updating expressions becomes easier.

[0111] The stamp function converts visual expressions into sound. The correspondence between stamps and sound is defined as follows: clapping stamp (ID:S01) → "Clap clap clap, that's amazing!", heart stamp (ID:S02) → "I love you", smiling stamp (ID:S03) → "Smiling" (with laughing sound effect), worried face stamp (ID:S04) → "What's wrong? Are you okay?".

[0112] The voice message generation unit 13 generates a corresponding voice expression upon receiving a stamp ID. During this process, sound effects, emotional emphasis, and intonation adjustments are automatically applied. This allows for the easy sending of messages with rich emotional expression, even with simple operation.

[0113] Figure 23 is a conceptual diagram of environmentally adaptive acoustic adjustment. The analysis unit 17 has the function of extracting ambient environmental information from the second voice message 54.

[0114] The analysis unit 17 estimates the ambient noise level and type of noise by spectral analysis of the background sound. The noise level is classified into three stages, for example, silence (below 40 dB), normal (40-60 dB), and noisy environment (above 60 dB). The type of noise is classified into steady noise (such as air conditioning noise), fluctuating noise (such as traffic noise), and impact noise (such as construction noise).

[0115] The voice message generation unit 13 has a function to adjust the acoustic characteristics of the first voice message 53 based on estimated environmental information. If the noise level is high, it increases the volume by 3-6 dB and applies spectral enhancement processing to improve the clarity of consonants. In addition, it selectively amplifies frequency bands that are easily masked depending on the type of noise. This ensures that children can reliably hear the message in any environment.

[0116] Figure 24 is an explanatory diagram of the child profile processing. The memory unit 11 has the function of storing the profile information 112 of the person being monitored. The profile information 112 includes attribute information such as the child's age, grade level, comprehension level, and language development stage.

[0117] The voice message generation unit 13 has a function to automatically adjust the wording and voice expression parameters of the message based on the profile information 112. For example, if the target audience is elementary school children in the lower grades (6-8 years old), the following adjustments are made:

[0118] Regarding word substitutions, "Please go home" will be automatically converted to "Please come home," and "Please contact me" will be converted to "Please contact me." Messages containing kanji will be changed to prioritize hiragana pronunciation. Additionally, sentences will be shortened to limit the amount of information per sentence.

[0119] Regarding adjustments to voice expression, the speaking speed is automatically set to 0.8 times the standard speed, and the pauses between phrases are extended by 20%. In addition, the intonation is raised at the end of words to create a friendly feel, and the pitch changes are exaggerated for important words to attract attention. As a result, easy-to-understand messages adapted to the cognitive development stage of children are generated, ensuring reliable information transmission.

[0120] Figure 25 is an operational diagram of the emergency voice enhancement function. The voice message generation unit 13 has a function to determine the degree of urgency from the content of the message.

[0121] Urgency assessment is performed by combining three methods: keyword matching, contextual analysis, and sentiment analysis. Keyword matching detects vocabulary indicating urgency, such as "dangerous," "immediately," "urgent," and "right now." Contextual analysis analyzes the patterns of imperative and negative sentences. Sentiment analysis evaluates the degree of emotional urgency estimated from the text.

[0122] If the urgency level exceeds a predetermined threshold, the voice message generation unit 13 applies the following voice enhancement processing: increasing the volume by 6 dB from normal, slowing down the speaking speed of important phrases to 0.7 times, inserting a 0.5-second pause before and after keywords, and adding a warning sound (1 kHz beep) at the beginning of the message. This ensures that the child's attention is reliably captured even in emergencies, and that important instructions are not missed.

[0123] Figure 26 is a sequence diagram of the automatic retransmission function. The analysis unit 17 has the function of analyzing the content of the second voice message 54 and evaluating its relationship with the preceding first voice message 53.

[0124] Relevance evaluation is achieved by transcribing the second voice message 54 into text using speech recognition and performing semantic analysis. For example, if the first voice message is "Let me know when you've finished your homework," and the response indicates a lack of understanding, such as "Huh? What?" or "Say that again," it is determined that resending is necessary.

[0125] If it is determined that retransmission is necessary, the voice message generation unit 13 performs retransmission with the following adjustments: slowing down the speaking speed to 0.7 times, increasing the volume by 3 dB, emphasizing important keywords, and automatically rephrasing into simpler expressions. The transmission unit 14 retransmits the adjusted voice message to the monitored person's terminal 30. This ensures that the message is reliably transmitted even in noisy environments or situations where the person is distracted.

[0126] The monitored person terminal 30 of this embodiment has a function to suppress power consumption by dynamically controlling the content displayed on the display 35 according to the remaining battery level of the terminal. This function is realized without compromising the quality of communication through "voice that reflects the voice quality of the guardian," which is the core of the present invention. In other words, since the monitored person can auditorily identify the sender of a message by the voice quality of the played voice, even if the display is controlled not to show images such as sender icons on the power-consuming display, the user can recognize who the message is from without any problems, resulting in a power saving effect that extends the operating time of the terminal.

[0127] The specific processing is triggered by the reception of a message. First, when the monitored terminal 30 receives the first voice message 53 from the information processing device 10, the control unit 311 checks the current battery level via the power management unit 317 prior to the voice playback process.

[0128] Next, the control unit 311 compares the confirmed battery level with a preset threshold (for example, 20%).

[0129] In one example of this embodiment, if the battery level is above a threshold (battery level ≥ threshold), the control unit 311 performs normal display processing. That is, in parallel with the playback of the first voice message 53, it displays image data, including a photograph of the message sender's face, on the display 35. On the other hand, if the battery level is below a threshold (battery level < threshold), the control unit 311 suppresses (skips) the display of image data on the display 35.

[0130] The threshold determination logic is not limited to the above. For example, the system may be configured to perform normal display processing when the battery level exceeds the threshold (battery level > threshold) and to suppress display processing when the battery level is below the threshold (battery level ≤ threshold). In this specification, "control based on battery level" is intended to encompass both of these embodiments.

[0131] In this way, by selectively limiting the display function only when the battery level is low during the processing performed for each message received, it is possible to effectively extend the overall operating time of the device without compromising practicality. This is a remarkable effect that is only possible because this invention can reproduce the familiar voice quality of a parent, which would be difficult to achieve if the voice were a mechanically synthesized voice, as the child would not know who the message was from without an icon display and would feel anxious. In other words, this invention solves a new problem that conventional monitoring services could not solve: by using auditory information such as the voice quality of the voice, it is possible to replace or omit the power-consuming processing of displaying visual information and extend the operating time of the device.

[0132] The configurations and operations described above in (I) to (IV) constitute the first embodiment which forms the basis of this disclosure. That is, a series of mechanisms that use a voice model unique to the guardian, who is equivalent to the caregiver, to generate voice messages that reflect the voice quality of the guardian from non-voice messages and transmit them to the child's terminal, who is equivalent to the person being cared for.

[0133] Figure 27 is a block diagram of the voice quality improvement process. In the second embodiment, a technique for further improving voice quality will be described. The voice message generation unit 13 has a high-quality voice generation function using a neural vocoder.

[0134] Specifically, it employs generative adversarial network (GAN)-based vocoders such as WaveGAN, MelGAN, and HiFi-GAN. These vocoders generate audio waveforms directly from Mel spectrograms, achieving a more natural sound quality compared to conventional signal processing-based methods such as Griffin-Rim and WORLD.

[0135] The voice message generation unit 13 has a function that faithfully reproduces even high-frequency components by setting the sampling rate to 48kHz. As a result, the clarity of consonants is improved, and the distinguishability of fricative sounds such as "sa," "shi," and "su," and plosive sounds such as "ta," "te," and "to" is improved. This results in the generation of voices that are easy to hear even in noisy environments.

[0136] In the third embodiment, a technology for achieving richer emotional expression will be described. The voice message generation unit 13 has a function for implementing a multidimensional emotional expression model.

[0137] Emotions are represented in a three-dimensional space consisting of Valence (pleasure-displeasure), Arousal (arousal-calmness), and Dominance (dominance-submission). For example, "gentle" is quantified as Valence: +0.7, Arousal: -0.3, Dominance: 0.0. "Encouraging" is represented as Valence: +0.8, Arousal: +0.5, Dominance: +0.3.

[0138] The voice message generation unit 13 has a conversion table that maintains the correspondence between these three-dimensional coordinates and acoustic features. The conversion table is constructed by machine learning from a large-scale emotional voice database. When arbitrary emotional coordinates are input, the corresponding fundamental frequency fluctuation pattern, speech rate, volume change, etc., are determined. This results in the ability to naturally express complex emotions such as "I'm a little worried, but I want to encourage them" (Valence: +0.4, Arousal: 0.0, Dominance: +0.2).

[0139] In the fourth embodiment, a technique for minimizing the delay from message transmission to audio playback will be described. The audio message generation unit 13 has an optimization function that keeps the processing delay to within 1 second.

[0140] The first optimization method is parallelization of the speech synthesis process. The speech message generation unit 13 divides the input text into phrase units and performs speech synthesis in parallel using multiple GPU cores. Once the synthesis of each phrase is complete, they are sequentially combined to generate the final speech.

[0141] The second method is streaming synthesis. The transmission unit 14 has the function of sequentially starting to transmit audio data to the monitored terminal 30 as the generation of the audio data is completed. The monitored terminal 30 starts playback sequentially from the received data and starts audio output without waiting for the reception of all data to be completed.

[0142] A third method is edge caching. Audio data for frequently used voice models and standard messages is cached in the internal storage of the monitored device 30. For such messages, network latency can be completely avoided. These optimizations result in a near real-time response, from the parent's sending operation to the start of audio playback on the child's device.

[0143] In the fifth embodiment, an implementation with enhanced privacy protection will be described. Since the voice data is biometric information that can identify an individual, the storage unit 11 has advanced security functions.

[0144] Differential privacy technology is applied to the storage of the speech model 50. The speech learning unit 16 has the function of adding controlled noise to the parameters of the speech model. The noise level is controlled by the privacy protection level ε (epsilon), which is usually set to ε = 1.0. This makes it computationally impossible to reverse-calculate the original speech data from the speech model.

[0145] Communication units 23 and 34 have end-to-end encryption capabilities. Voice messages are encrypted using public-key cryptography on the monitor terminal 20 and can only be decrypted using the private key on the monitored terminal 30. The information processing device 10 only relays the encrypted data and cannot access its contents. This eliminates the risk of eavesdropping by third parties and unauthorized access by server administrators.

[0146] In the sixth embodiment, an extension that supports languages ​​other than Japanese will be described. The voice message generation unit 13 has a multilingual speech synthesis function.

[0147] The speech model 50 has a structure separated into a language-dependent layer and a language-independent layer. The language-independent layer represents physiological characteristics such as the speaker's vocal tract characteristics and glottal source characteristics, and is used in common across all languages. The language-dependent layer represents the phonological system, prosody rules, etc., of each language.

[0148] Supported languages ​​include Japanese, English, Chinese (Mandarin and Cantonese), Korean, Spanish, Portuguese, Vietnamese, Tagalog, and others. A dedicated speech synthesis engine and text analyzer are provided for each language, generating natural pronunciation tailored to the characteristics of each language. This allows international marriage families and foreign families residing in Japan to utilize the monitoring service in their native language.

[0149] In the seventh embodiment, a group function for simultaneously monitoring multiple children will be described. The memory unit 11 has a function for managing group information of the children being monitored.

[0150] Groups are created based on units such as family groups (siblings), after-school groups, and extracurricular activity groups. The monitoring terminal 20 has a function to send messages to the entire group. However, the voice message generation unit 13 generates individually optimized voice messages based on each child's profile information.

[0151] For example, when sending a message like "It's snack time" to siblings simultaneously, the voice will be generated at a normal speed for the 8-year-old brother and at a slower speed for the 5-year-old brother. Furthermore, the playback timing on each child's device is adjusted, sending the message slightly earlier to younger children to account for differences in preparation time. This allows for efficient monitoring of multiple children while providing personalized care that takes their individual characteristics into consideration.

[0152] Next, as an example of how this disclosure can be applied, we will describe the case where this disclosure is applied to a device used by a specific user.

[0153] For example, let's assume the user of the monitored device 30 is a child in the lower grades of elementary school (6-8 years old). Children in this age group are still developing their reading comprehension skills, and often have difficulty understanding kanji and katakana in particular. They also have short attention spans and find it difficult to use devices that require complex operations.

[0154] Considering these characteristics, the monitored terminal 30 in this application example should preferably have the following physical and functional features. Physically, the terminal should be miniaturized to dimensions of 50mm width x 50mm height x 21mm thickness or less, and its weight should be reduced to 60g or less. The housing should be covered with a silicone shock-absorbing cover and have IPX7 equivalent water resistance.

[0155] The control interface is limited to three or fewer physical buttons, each with a large design of 15mm or more in diameter. The large central button functions as an emergency button that sends an SOS notification to a parent with a single touch. This SOS button is activated by pressing and holding it for 3 seconds or more to prevent accidental activation.

[0156] The display is a small screen of 2 inches or less, and the displayed content is limited to the time, battery level, and message notifications. It does not display text; information is conveyed through icons and voice prompts. This enables intuitive operation that does not rely on reading ability.

[0157] The audio output unit 31 incorporates an acoustic filter that appropriately suppresses high-frequency components (4kHz and above), taking into consideration the hearing characteristics of children. Furthermore, the maximum volume is limited to 85dB or less, providing a hearing protection function.

[0158] These features distinguish the terminal in this application from a mere general-purpose communication terminal, allowing it to function as a dedicated device optimized for the physical and cognitive characteristics of children. Under these specific constraints, the voice message generation function that reflects the parent's voice quality is particularly important for children who cannot rely on text information.

[0159] This section describes a modified example in which a wearable device such as a smartwatch is used as the monitoring terminal 30 for the person being monitored. The wearable device enables continuous monitoring through constant wear.

[0160] The wearable device incorporates a vibration motor and has a notification function that combines voice and vibration. Messages from the mother are identified by two short vibrations, messages from the father by three short vibrations, and urgent messages by continuous vibrations. Due to screen size limitations, the voice interface is the primary means of communication. This ensures that the user can reliably recognize incoming messages even while active.

[0161] This section describes a modified version that interacts with a smart speaker in the home. The information processing device 10 has the function of connecting to a smart speaker API.

[0162] When a child returns home, their arrival is detected from the location information of the monitored device 30, and a message from the parent is automatically played from the smart speaker. The smart speaker's high-quality speaker reproduces the parent's voice with greater realism. Furthermore, children can reply using voice commands such as "Alexa, send a message to Mom." This results in smoother parent-child communication within the home.

[0163] This section describes future integration with AR (Augmented Reality) and VR (Virtual Reality) technologies. The information processing device 10 has a 3D avatar generation function.

[0164] A 3D avatar is created from a parent's facial photograph and voice model, generating real-time mouth movements. When the child puts on AR glasses, the parent's avatar appears before them and speaks to them, moving its mouth in sync with the voice message. The integration of audio and visual information creates a sense of presence, as if the parent were right there. This has the effect of creating a deeper emotional connection that transcends physical distance.

[0165] This section explains the application of this technology to the medical and nursing care fields. It is anticipated to be used in elderly care facilities and hospitals.

[0166] This system can be applied to dementia patients to provide medication reminders and lifestyle guidance using the voices of family members. The memory unit 11 manages the speaking speed and repetition rate according to the patient's cognitive function level. Instructions given in the voices of family members are more readily accepted than those given by nurses, improving medication compliance. Encouraging messages from family members to hospitalized patients also contribute to the patient's emotional stability. This results in the simultaneous improvement of the quality of medical and nursing care and the reduction of the emotional burden on families.

[0167] This section describes the application of this technology to the field of education. It is envisioned for use as a home learning support system.

[0168] The voice message generation unit 13 has the function of generating instructions and encouragement for learning tasks in the voice of a parent. Messages such as "Today's homework is page 20 of the math drill" and "You did a great job!" are automatically generated in the parent's voice. In addition, the picture book reading function creates a voice model from a voice recording made by the parent, making it possible to read other picture books in the parent's voice as well. As a result, even when the parent is absent, warm encouragement and learning support can be continued, which helps to maintain and improve the child's motivation to learn.

[0169] Figure 29 is a comparison chart of the effects of system implementation. The following quantitative and qualitative improvements were confirmed before and after the implementation of this system.

[0170] In terms of children's psychological indicators, the average anxiety score when receiving messages decreased from 4.2 to 1.8 (on a 5-point scale). Furthermore, the message response rate improved from 45% to 78%, and the number of spontaneous replies increased 2.5 times.

[0171] In the message comprehension test, the accuracy rate improved from 68% to 89% compared to conventional machine-generated voice messages. Particularly significant improvement was observed in understanding messages containing emotions (such as "I'm worried" or "I'm happy").

[0172] In a survey of parental satisfaction, 92% of respondents said their communication with their children had deepened, and 88% said they felt connected even when apart. Furthermore, 95% expressed an extremely high intention to continue using the system. These results demonstrate that the system strengthens the emotional bond between parents and children and enhances the essential value of the monitoring service.

[0173] This section explains the technical considerations when implementing this system.

[0174] To ensure scalability, the information processing device 10 is built on a cloud infrastructure with automatic scaling capabilities. GPU instances are dynamically added and removed according to the load of speech synthesis processing. Processing capacity is automatically increased during peak times (morning school hours and evening school hours).

[0175] To optimize latency, we utilize a CDN (Content Delivery Network) and perform processing on geographically distributed edge servers. In Japan, edge servers are located in three locations: Tokyo, Osaka, and Fukuoka, and speech synthesis is performed on the nearest server.

[0176] As a failover function, a redundant configuration is employed that automatically switches to the secondary system if the primary system fails. If speech synthesis fails, it switches to a fallback speech that prioritizes real-time performance, even if it is of lower quality. These implementations achieve an availability of over 99.9%, enabling the provision of stable services.

[0177] As explained above, the voice message generation system disclosed herein, by using a voice model unique to each parent, can generate warm and natural voice messages that cannot be achieved with machine-generated speech. This significantly improves the quality of parent-child communication in the monitoring service, fostering a sense of security in children and contributing to the peace of mind of parents.

[0178] This system is realized by integrating cutting-edge technologies such as speech synthesis, machine learning, and natural language processing, and further improvements are expected as these technologies continue to advance. We hope that this system will be widely used as one solution that technology can provide to the important social issue of child safety monitoring.

[0179] It should be noted that the present invention is not limited to the embodiments and applications described above, and various modifications are possible without departing from the spirit of the invention.

[0180] (summary) (General tasks) One of the objectives of the present invention is to improve the quality and transmittance of messages in remote communication such as monitoring services, thereby facilitating smooth communication and deepening emotional connections between users.

[0181] Issues related to (Appendix 1) In conventional child monitoring services, the mechanical voices generated by general-purpose speech synthesis engines presented a technical challenge: children receiving messages often felt anxious, and the emotional nuances of parental love and concern were not adequately conveyed. This problem could not be solved simply by adjusting voice parameters; voice generation that reflected the speaker's personality itself was necessary. (Note 1) An information processing device comprising: a storage unit that stores account information of a caregiver and a voice model unique to the caregiver in association with each other; a receiving unit that receives a non-voice message input on the caregiver's terminal and the account information of the caregiver who input the non-voice message; a voice message generation unit that synthesizes the non-voice message into voice by applying the voice model associated with the account information to generate a first voice message; and a transmission unit that transmits the first voice message to the caregiver's terminal. This configuration allows children to receive voice messages in a familiar voice tone from their guardians, eliminating the anxiety associated with conventional, robotic voices and allowing them to understand the message content with confidence. This significantly improves the quality of parent-child communication in the monitoring service and ensures that the intended message is properly conveyed.

[0182] Issues related to (Appendix 2) This system solves the problem of not being able to clearly identify who a message is from when multiple caregivers are watching over one person being cared for. (Note 2) The storage unit stores the account information of multiple caregivers and the corresponding voice model for each, and the voice message generation unit identifies the corresponding voice model based on the account information and generates the first voice message. This allows messages from multiple guardians (mother, father, grandparents, etc.) to be clearly distinguished and played back based on their respective voices, enabling children to instantly recognize who the message is from and resulting in more reassuring communication.

[0183] Issues related to (Appendix 3) This solution addresses the problem of unclear methods for generating and updating voice models, and the difficulty in adequately reproducing voice quality with only the limited voice data available during initial registration. (Note 3) The receiving unit includes a voice learning unit that collects the third voice message received from the caregiver terminal and generates or updates the voice model. This means that the voice quality improves the longer you use the system, resulting in more natural and parent-like voices being generated. Through continuous learning, a high-quality voice model is gradually built, even starting with the limited data from the initial registration.

[0184] Issues related to (Appendix 4) This solution addresses the challenges of inefficient audio data collection for voice learning and the burden it places on users by requiring special recording sessions. (Note 4) The third voice message is one or more recorded voice messages previously sent from the monitoring terminal to the monitored terminal. This allows the voice model to be improved naturally through normal use, without requiring any special voice registration, by utilizing voice recordings made during regular message sending. Continuous quality improvement is achieved without burdening the user.

[0185] Issues related to (Appendix 5) This solves the problem of being unable to properly learn the individual differences in speaking styles, resulting in the generation of only uniform speech models. (Note 5) The voice learning unit learns the speaking characteristics of each caregiver's account based on the third voice message received by the receiving unit and reflects them in the voice model. This accurately reproduces each parent's unique speaking style, intonation, speaking speed, and vocal pitch, resulting in more personalized voice messages. For the child, it provides a natural experience, as if the parent themselves were speaking to them.

[0186] Issues related to (Appendix 6) This addresses the problem of generated audio lacking emotional expression, which prevents the message's intent and the parents' feelings from being properly conveyed. (Note 6) The receiving unit further receives the voice expression parameters transmitted from the monitoring terminal, and the voice message generation unit applies the voice expression parameters together with the voice model to generate the first voice message. This allows for precise control over emotional expressions such as "gentle," "encouraging," and "concerned," as well as vocal expressions such as speaking speed and volume, enabling parents to more accurately convey their intentions and feelings to their children. Even subtle nuances that are difficult to express through text alone can be conveyed through voice.

[0187] Issues related to (Appendix 7) This solves the problem that setting voice expression parameters is technically difficult and cannot be used by the average parent. (Note 7) The system includes a UI control unit that displays an operation screen for identifying voice expression parameters on the monitor's terminal. This allows even parents without technical knowledge to easily adjust emotional expression and voice through an intuitive user interface. The visual interface, including sliders and buttons, enables anyone to achieve the desired voice expression.

[0188] Issues related to (Appendix 8) This solves the problem of unclear timing for setting voice expression parameters, which hinders the message creation process. (Note 8) The UI control unit places an input element for setting voice expression parameters on either the message input or selection screen, or the message sending screen for an input or selected message. This allows users to set voice expressions within the natural flow of message creation, improving usability and reducing the burden on parents. Voice expressions can be adjusted at the appropriate time depending on the message content.

[0189] Issues related to (Appendix 9) This solution addresses the challenge of parents having to constantly think of appropriate messages and verbal expressions for each situation, making quick responses difficult. (Note 9) The receiving unit further receives location information of the monitored person's terminal, and the UI control unit displays suggested message content and voice expression parameters appropriate to the situation on the monitored person's terminal based on the location information and time information. This allows the system to estimate the situation based on the child's current location and time, and automatically suggests appropriate messages and voice prompts, such as "It's time to go home from school, so speak to them gently" or "This is a dangerous area, so warn them." This enables parents to communicate quickly and appropriately.

[0190] Issues related to (Appendix 10) This solution addresses the challenge of requiring parents to manually send suggested messages, which makes timely responses difficult. (Note 10) When the proposal is selected on the monitor terminal, the UI control unit controls the information processing device to generate and transmit the first voice message in accordance with that selection. This allows parents to send appropriate messages with a single tap, significantly reducing their workload and enabling quick responses to their children. It also ensures swift and appropriate communication even in emergencies.

[0191] Issues related to (Appendix 11) This system solves the problem of generating uniform voice messages regardless of the age or comprehension level of the person being monitored, making it difficult for them to understand the message. (Note 11) The memory unit further stores the profile information of the person being monitored, and the voice message generation unit generates the first voice message by adjusting at least one of the wording, volume, and speed of the message based on the profile information. This automatically adjusts the reading style (prioritizing hiragana), speaking speed, and language to be easy to understand, according to the child's age, grade level, and comprehension level. It generates easy-to-understand messages adapted to the child's cognitive development stage, ensuring reliable information transmission.

[0192] Issues related to (Appendix 12) This solution addresses the problem of emergency messages being conveyed using the same language as normal conversations, which can lead to an inadequate understanding of the urgency of the situation. (Note 12) The receiving unit further comprises an analysis unit that receives a second voice message from the monitored person's terminal and analyzes the content of the received second voice message, and the voice message generation unit generates the first voice message with emphasis on at least one of the wording, volume, or speed of the first voice message based on the urgency of the analyzed message. This automatically increases volume, slows down important phrases, and adds warning sounds during emergencies, ensuring children's attention is captured and they don't miss important instructions. This improves the success rate of avoiding danger and responding to emergencies.

[0193] Issues related to (Appendix 13) This addresses the problem of being unable to understand a child's emotional state and respond appropriately. (Note 13) The receiving unit further includes an analysis unit that receives a second voice message from the monitored person's terminal and analyzes the acoustic characteristics of the received second voice message, and the voice message generation unit generates the first voice message based on the analyzed acoustic characteristics of the second voice message. This allows the system to estimate a child's emotional state, such as "energetic," "tired," or "sad," from their voice and generate a message in an appropriate tone accordingly. This enables more empathetic communication that is attentive to the child's psychological state.

[0194] Issues related to (Appendix 14) This solves the problem of communication breakdowns when there is no way to resend a message if the child does not understand it. (Note 14) The receiving unit further includes an analysis unit that receives a second voice message from the monitored person's terminal and analyzes the content of the received second voice message. The analysis unit determines whether retransmission is necessary based on the content of the second voice message, and if it determines that retransmission is necessary, the transmitting unit retransmits the first voice message that was transmitted before the second voice message. This automatically detects responses indicating a lack of understanding, such as "Say that again," and automatically resends the message using clearer language and adjusted audio. This ensures reliable message delivery even in noisy environments or situations where the recipient is easily distracted.

[0195] Issues related to (Appendix 15) This solves the problem of situations where audio is difficult to hear because it is played at the same volume and sound quality regardless of the surrounding environment. (Note 15) The receiving unit receives a second voice message from the monitored person's terminal and further comprises an analysis unit that analyzes surrounding environment information from the received second voice message, and the voice message generation unit generates the first voice message based on the surrounding environment information. This allows for volume adjustment and frequency response optimization based on noise levels and types, ensuring children can reliably hear messages in any environment. Clear voice communication is possible even outdoors or in noisy places.

[0196] Issues related to (Appendix 16) This addresses the challenge of limited non-voice messaging formats that cannot meet diverse communication needs. (Note 16) Non-voice messages include at least one of the following: text data, identification information corresponding to a predefined template message, and identification information corresponding to a stamp or button for expressing a reaction. This allows for diverse input methods, including free text input, selection of pre-set messages, and expression of emotions through stamps. This enables flexible communication tailored to the parent's situation and preferences, making it easy for parents who are not comfortable with typing to send messages that include rich emotional expression.

[0197] Issues related to (Appendix 17) In the information processing methods used in child monitoring services, there were technical challenges such as children feeling anxious with machine-generated speech and the parents' intentions not being properly conveyed. (Note 17) An information processing method in a monitoring system that transmits location information of a monitored terminal to a monitor terminal, wherein a processor stores the monitor's account information and a voice model unique to the monitor in association with each other; the processor receives a non-voice message input to the monitor terminal and the account information of the monitor who input the non-voice message; the processor applies the voice model associated with the account information to synthesize the non-voice message into speech to generate a first voice message; and the processor transmits the first voice message to the monitored terminal. This method allows children to receive voice messages in a familiar voice tone of their guardian, eliminating the anxiety associated with robotic voices. As a result, the quality of parent-child communication in the monitoring service improves, and the intent of the message is conveyed appropriately.

[0198] Issues related to (Appendix 18) In the program to implement the child monitoring service, there was a technical challenge: using general-purpose speech synthesis caused anxiety in children and failed to convey the emotions of the parents. (Note 18) A program that causes the processor of an information processing device in a monitoring system that transmits location information of a monitored device to a monitorer's device to store the monitorer's account information and a unique voice model of the monitor in association with each other, receives a non-voice message entered in the monitorer's device and the account information of the monitor who entered the non-voice message, applies the voice model associated with the account information to synthesize the non-voice message into speech to generate a first voice message, and transmits the first voice message to the monitored device. This program enables the creation of voice messages that reflect the voice characteristics of parents, using existing hardware. This significantly improves the quality of monitoring services without requiring large-scale capital investment.

[0199] (Notes related to system claims) Issues related to (Appendix 19) In the entire monitoring system, including the monitor's terminal and the monitored person's terminal, there was a problem in that mechanical voices weakened the emotional connection between parent and child. (Note 19) A monitoring system comprising a monitor terminal, a monitored terminal, and an information processing device, wherein the system transmits location information of the monitored terminal to the monitor terminal, the information processing device comprising: a storage unit that stores the monitor's account information and the monitor's unique voice model in association with each other; a receiving unit that receives a non-voice message input to the monitor terminal and the account information of the monitor who input the non-voice message; a voice message generation unit that synthesizes the non-voice message by applying the voice model associated with the account information to generate a first voice message; and a transmission unit that transmits the first voice message to the monitored terminal. This system enables not only location-based monitoring but also warm communication through the voices of parents. It allows for the maintenance and strengthening of the emotional bond between parents and children, transcending physical distance.

[0200] Issues related to (Appendix 20) This solves the problem of reducing battery consumption in the monitoring device for the person being monitored, thereby enabling long-term monitoring. (Note 20) The terminal for the person being monitored includes a power management unit that detects the remaining battery level, and a control unit that suppresses the display of images on the display when the first voice message is played if the remaining battery level is below a predetermined threshold. This allows the sender to be audibly identified by the parent's voice quality, meaning that image display can be omitted when the battery level is low, significantly extending the device's operating time. This is a remarkable effect unique to this invention that is difficult to achieve with mechanical voices.

[0201] This disclosure is not limited to the embodiments described above, and various modifications and applications are possible. For example, technical elements such as programming languages, development frameworks, database management systems, and communication protocols can be appropriately selected according to specific implementation requirements. Furthermore, regarding the implementation method of the speech model, the most suitable one can be selected from various neural speech synthesis technologies such as WaveNet, Tacotron2, and FastSpeech2, depending on the implementation environment and required performance.

[0202] Furthermore, this disclosure can be applied to fields other than monitoring services. For example, it can be used in various situations such as delivering voice messages from family members to elderly care facilities, sending encouraging messages to patients in hospitals, and providing learning support messages from parents in the education sector. In these applications as well, the essential effect of the present invention, which is to provide a sense of security through a familiar voice, is maintained. [Explanation of Symbols]

[0203] 1. Monitoring System 10 Information Processing Devices 11 Storage section 12 Receiver 13. Voice message generation unit 14. Transmitter 15UI Control Unit 16. Audio Learning Section 17 Analysis Department 20. Monitoring terminal 30. Terminal for the person being monitored 40 Networks 50 Voice Models 51. Speech Expression Parameters 52 Non-voice messages 53. First audio message 54. Second audio message 55 Third audio message (Detailed explanations of the symbols are omitted below.)

Claims

1. An information processing device in a monitoring system that transmits location information of a monitored device to a monitoring device, A memory unit that stores the account information of the caregiver and the voice model unique to that caregiver in association with each other, A receiving unit that receives a non-voice message entered in the aforementioned monitoring terminal and the account information of the monitoring person who entered the non-voice message, A voice message generation unit that generates a first voice message by synthesizing the non-voice message into speech using a voice model associated with the account information, A transmitting unit that transmits the first voice message to the terminal for the person being monitored, An information processing device equipped with the following features.

2. The aforementioned memory unit stores the account information of multiple caregivers and the corresponding voice model for each, in association with each other. The voice message generation unit identifies a corresponding voice model based on the account information and generates the first voice message. The information processing apparatus according to claim 1.

3. The receiving unit further comprises a voice learning unit that collects a third voice message received from the caregiver terminal and generates or updates the voice model. The information processing apparatus according to claim 1.

4. The third voice message is characterized by being one or more recorded voice messages previously transmitted from the monitor terminal to the monitored terminal. The information processing apparatus according to claim 3.

5. The voice learning unit learns the speaking characteristics of each caregiver's account based on the third voice message received by the receiving unit and reflects them in the voice model. The information processing apparatus according to claim 3.

6. The receiving unit further receives the voice expression parameters transmitted from the monitoring terminal, The voice message generation unit generates the first voice message by applying the voice expression parameters together with the voice model. The information processing apparatus according to claim 1.

7. The system further includes a UI control unit that displays an operation screen for identifying the aforementioned voice expression parameters on the monitor's terminal. The information processing apparatus according to claim 6.

8. The UI control unit places an input element for setting voice expression parameters on either the message input or selection screen, or the message transmission screen for the input or selected message. The information processing apparatus according to claim 7.

9. The receiving unit further receives location information of the monitored person's terminal, The UI control unit displays suggested message content and voice expression parameters appropriate to the situation on the caregiver's terminal based on the location information and time information. The information processing apparatus according to claim 7.

10. The UI control unit is characterized in that, when the proposal is selected on the monitor terminal, it controls the information processing device to generate and transmit the first voice message in accordance with the selection. The information processing apparatus according to claim 9.

11. The memory unit further stores the profile information of the person being monitored. The voice message generation unit generates the first voice message by adjusting at least one of the wording, volume, and speed of the message based on the profile information. The information processing apparatus according to claim 1.

12. The receiving unit receives a second voice message from the monitored person's terminal. The system further includes an analysis unit that analyzes the content of the received second voice message, The voice message generation unit generates the first voice message by emphasizing at least one of the following: wording, volume, or speed, based on the urgency of the analyzed message. The information processing apparatus according to claim 1.

13. The receiving unit receives a second voice message from the monitored person's terminal. The system further includes an analysis unit that analyzes the acoustic characteristics of the received second voice message, The voice message generation unit generates the first voice message based on the acoustic characteristics of the analyzed second voice message. The information processing apparatus according to claim 1.

14. The receiving unit receives a second voice message from the monitored person's terminal. The system further includes an analysis unit that analyzes the content of the received second voice message, The analysis unit determines whether retransmission is necessary based on the content of the second voice message, and if it determines that retransmission is necessary, the transmission unit retransmits the first voice message that was transmitted before the second voice message. The information processing apparatus according to claim 1.

15. The receiving unit receives a second voice message from the monitored person's terminal. The system further includes an analysis unit that analyzes surrounding environmental information from the received second voice message, The voice message generation unit generates the first voice message based on the surrounding environment information. The information processing apparatus according to claim 1.

16. The non-voice message is characterized by including at least one of the following: text data, identification information corresponding to a predetermined standard message, and identification information corresponding to a stamp or button for expressing a reaction. The information processing apparatus according to claim 1.

17. An information processing method in a monitoring system that transmits location information of a monitored device to a monitor device, The processor stores the caregiver's account information and the caregiver's unique voice model in association with each other. The processor receives the non-voice message entered in the monitoring terminal and the account information of the monitoring person who entered the non-voice message. The processor applies a speech model associated with the account information to synthesize the non-speech message into speech and generate a first speech message. The processor transmits the first voice message to the monitored person's terminal. Information processing methods.

18. In a monitoring system that transmits location information of the monitored terminal to the monitoring terminal, the processor of the information processing device includes: The system associates the caregiver's account information with the caregiver's unique voice model and stores it accordingly. The monitoring terminal receives the non-voice message entered and the account information of the monitor who entered the non-voice message. The voice model associated with the account information is applied to synthesize the non-voice message into speech and generate a first voice message. The first voice message is transmitted to the terminal for the person being monitored. A program that executes a process.

19. A monitoring system comprising a monitoring terminal, a monitoring terminal for the person being monitored, and an information processing device, wherein the monitoring system transmits location information of the monitoring terminal to the monitoring terminal, The aforementioned information processing device is A memory unit that stores the account information of the caregiver and the voice model unique to that caregiver in association with each other, A receiving unit that receives a non-voice message entered in the aforementioned monitoring terminal and the account information of the monitoring person who entered the non-voice message, A voice message generation unit that generates a first voice message by synthesizing the non-voice message into speech using a voice model associated with the account information, A transmitting unit that transmits the first voice message to the terminal for the person being monitored, A monitoring system equipped with the following features.

20. The aforementioned terminal for the person being monitored is A power management unit that detects the remaining battery level, A control unit that suppresses the display of an image on the display when the first voice message is played if the battery level is below a predetermined threshold, The monitoring system according to claim 19, characterized by comprising: