Terminal device, voice message processing method, program, and voice message system
The terminal device continues voice message recording even after the operation is canceled if speech is detected, addressing premature termination issues and improving message integrity and reliability.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- MIXI INC
- Filing Date
- 2025-05-22
- Publication Date
- 2026-07-02
Smart Images

Figure 2026110456000001_ABST
Abstract
Description
Technical Field
[0001] The present invention relates to a terminal device, a voice message processing method, a program, and a voice message system.
Background Art
[0002] In recent years, a monitoring service has become widespread in which position information and messages of monitored persons such as children and the elderly are transmitted to other terminals (for example, smartphones) possessed by guardians using portable terminal devices (hereinafter also simply referred to as "terminal devices") equipped with a GPS (Global Positioning System) function or the like. In such a terminal device, a function that allows a monitored person to easily transmit a voice message to a guardian is required.
[0003] As a general method for transmitting a voice message, a push-to-talk (PTT) method is known. In this method, voice is recorded only while the user presses a specific button on the terminal device, and when the finger is lifted from the button, the recording ends and the recorded voice message is transmitted.
[0004] However, in a conventional terminal device using the PTT method, various users who have difficulty accurately pressing the button continuously, not only children with weak finger strength, but also due to the use of gloves, the working environment, aging, or temporary situations, there is a problem that the finger accidentally leaves the button during the message, and the recording ends prematurely. As a result, the content to be conveyed may not be completely recorded, and there is a possibility that the guardian may not accurately understand the situation of the child or may misunderstand it.
[0005] For example, Patent Document 1 discloses a media message recording technology having two recording modes: PTT communication mode and Tap-to-Start (TTS) communication mode. This document acknowledges the problem that in PTT mode, the user's finger may unintentionally slip off the button, causing the recording to abruptly end. However, the proposed solution to this problem is to provide TTS mode (a mode in which recording starts when the button is tapped and ends when voice activity detection (VAD) or the like is detected) as an alternative for long messages, and the concept of continuing the same recording session if speech continues after the PTT operation is released is not disclosed. [Prior art documents] [Patent Documents]
[0006] [Patent Document 1] U.S. Patent Application Publication No. 2015 / 0141072 [Overview of the Initiative] [Problems that the invention aims to solve]
[0007] This invention has been made in view of the above circumstances, and aims to provide a terminal device, a voice message processing method, a program, and a voice message system that can improve the reliability of communication in a mobile terminal having a voice message function. [Means for solving the problem]
[0008] To solve the above problems, a terminal device according to one aspect of the present invention includes: a detection unit that detects the start and cancellation of a user's voice message recording operation; a voice input unit that receives voice input; a speech determination unit that determines whether or not the user has spoken based on the voice input from the voice input unit after the cancellation of the recording operation; a recording control unit that, when the speech determination unit determines that there has been a speech, continues recording following the start of the recording operation and generates a recorded voice message; a positioning unit that measures the user's location information; and a transmission unit that transmits the generated voice message and the measured location information to a predetermined server, respectively. [Effects of the Invention]
[0009] According to the present invention, unintended interruptions in recording can be suppressed, and the integrity of the message can be enhanced. This improves the reliability of communication. [Brief explanation of the drawing]
[0010] [Figure 1] This is a schematic diagram showing the overall configuration of a voice messaging system according to one embodiment of the present invention. [Figure 2] Figure 1 is a block diagram showing an example of the hardware configuration of a terminal device. [Figure 3] Figure 1 is a block diagram showing an example of the functional configuration of a terminal device. [Figure 4] This figure shows an example of a data structure used in this embodiment. [Figure 5] This flowchart shows an example of the voice message recording and transmission process in the terminal device according to this embodiment. [Figure 6] This sequence diagram shows an example of the main processing sequence in the voice messaging system according to this embodiment. [Figure 7A] This is a front view showing an example of the external appearance of the terminal device according to this embodiment. [Figure 7B] Figure 7A shows an example of the display on the terminal device shown in the diagram. [Figure 7C]It is a diagram showing an example of a first layout variation regarding the appearance of the terminal device according to this embodiment. [Figure 7D] It is a diagram showing an example of a second layout variation regarding the appearance of the terminal device according to this embodiment. [Figure 7E] It is a diagram showing an example of a third layout variation regarding the appearance of the terminal device according to this embodiment. [Figure 7F] It is a diagram showing an example of a fourth layout variation regarding the appearance of the terminal device according to this embodiment. [Figure 8] It is a time chart showing an example of the operation of recording continuation control in the terminal device according to this embodiment. [Figure 9] It is a diagram showing an example of a conceptual configuration of machine learning VAD in the speech determination unit of the terminal device according to this embodiment. [Figure 10] It is a block diagram showing an example of the functional configuration of the server device according to this embodiment. [Figure 11] It is a block diagram showing an example of the functional configuration of the guardian terminal according to this embodiment. [Figure 12A] It is a diagram showing an example of a received message list display screen in the application of the guardian terminal according to this embodiment. [Figure 12B] It is a diagram showing an example of a message playback / position information display screen in the application of the guardian terminal according to this embodiment.
Mode for Carrying Out the Invention
[0011] Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In each figure, the same or corresponding components are denoted by the same reference numerals, and duplicate explanations will be omitted as appropriate.
[0012] In this specification, the "voice message" refers to voice information that is recorded, encoded, and generated in a form suitable for transmission in the terminal device 100 mainly for the purpose of a user transmitting information, intentions, or emotions to others (such as guardians or other family members). Voice memos that are only stored inside the terminal for the purpose of the user's own reminder, and lifelog data that records the surrounding acoustic environment without the user's active intention to transmit, can be distinguished in terms of their generation purpose and usage form. The recording control unit 123 generates such voice information with a transmission purpose as a "voice message".
[0013] Also, in the embodiments described below, a terminal device mainly used by a child who is the person under guardianship will be taken as an example for explanation, but the scope of application of the present invention is not limited to this. The technology of continuing recording as long as the speech continues even after the user's recording operation is canceled, which is the core of the present invention, can be applied to, for example, the elderly, users who have difficulty with precise button operations due to physical constraints, or workers in factories, delivery staff, security guards, etc., who need to communicate while wearing gloves or in situations where their hands are occupied during work, and also to adult users who use the voice message function through a specific application on a general smartphone. In a wider range of usage scenarios and terminal devices, it can bring the common advantages of preventing unintended recording interruptions and improving the integrity of messages. Therefore, the "user" or "person under guardianship" in this specification shall include such a wide range of entities unless clearly limited to children in the context.
[0014] (Overview of the entire system) FIG. 1 is a schematic diagram showing the overall configuration of a voice message system 1 according to an embodiment of the present invention. The voice message system 1 mainly includes a terminal device 100 carried by a person under guardianship (such as a child), a guardian terminal 300 used by a guardian or the like, and a server device 200 connected to these via a communication network NW.
[0015] Terminal device 100 has the function of recording a voice message from the person being monitored and transmitting the voice message and location information indicating the person's current location to server device 200. Server device 200 stores the voice message and location information received from terminal device 100 and transmits this information to parent terminal 300 (for example, via push notification or in-application display) in response to a request from parent terminal 300 or based on predetermined conditions. Parent terminal 300 is, for example, a smartphone or tablet device and performs voice message playback and location information display via dedicated application software (hereinafter simply referred to as "app").
[0016] In this specification, the “predetermined server” to which the terminal device 100 transmits voice messages and / or location information refers to one or more network computer devices configured to receive, store, process, or relay this information to a parent terminal 300 or the like. This “predetermined server” may be, for example, a specific server device 200 operated by the service provider of this voice messaging system, another server device designated based on the user’s settings or contract, or part of a distributed system consisting of a group of server devices. The transmission unit 125 transmits the information to one or more appropriate server devices according to the type of information or based on pre-configured routing information. Importantly, the transmitted information is processed so that it can ultimately be accessed or viewed by the intended recipient (e.g., a parent).
[0017] One of the core features of this embodiment is that, in the terminal device 100, even after the user starts recording a voice message (for example, by long-pressing a specific button) and then cancels the recording operation (for example, by releasing their finger from the button), the recording automatically continues as long as the user continues speaking. This prevents situations where the recording is interrupted mid-message, especially if a child fails to press the button correctly, thereby improving the completeness of the message.
[0018] (Description of hardware configuration) Next, with reference to Figure 2, an example of the hardware configuration of the terminal device 100 according to this embodiment will be described. Figure 2 is a block diagram showing the hardware configuration of the terminal device 100. The terminal device 100 includes a CPU (Central Processing Unit) 101 that controls the entire device, a RAM (Random Access Memory) 102 that temporarily stores programs executed by the CPU 101 and data being processed, and a non-volatile storage 103 (for example, flash memory) that stores programs, various setting information, generated voice message data, etc.
[0019] The terminal device 100 also includes operation buttons 104 for receiving user input (for example, physical buttons used for recording start / stop / cancel operations), a microphone 105 for voice input, and a speaker 106 for outputting voice messages, notification sounds, etc. A display 107 (for example, a small LCD or organic EL display) for displaying simple information may be provided as needed.
[0020] Figure 7A is a front view showing an example of the appearance of a terminal device 100 equipped with such operation buttons 104 and a display 107. For example, the operation buttons 104 and the display 107 are arranged on the surface of the housing. As shown in Figure 7B, the display 107 may show an icon 701 indicating that recording is in progress, an icon 702 indicating signal strength, an icon 703 indicating battery level, the current time 704, and so on.
[0021] Figures 7C to 7F show examples of arrangement variations for the appearance of the terminal device 100. Figure 7C shows an example where circular operation buttons 104 (shown by dotted lines) are placed on the back of a rectangular display 107; Figure 7D shows an example where a frame-shaped operation button 104 surrounds a circular display 107; Figure 7E shows an example where the circular display 107 and circular operation buttons 104 are arranged separately, one above the other; and Figure 7F shows an example where the rectangular display 107 and circular operation buttons 104 are arranged separately, one above the other. These variations can be selected as appropriate according to the user's preferences and ease of operation.
[0022] Furthermore, the terminal device 100 includes a GPS receiver 108 for receiving radio waves from GPS satellites to determine its current location, and a communication interface 109 (for example, a communication module such as LTE (Long Term Evolution) or Wi-Fi (Wireless Fidelity)) for wireless communication with the server device 200 and other devices. These parts are interconnected via a bus 110. The server device 200 is configured with a CPU, RAM, storage, communication interface, etc., similar to a typical computer system. Figure 10 is a block diagram showing an example of the main functional configuration of the server device 200, which will be described later. The parent terminal 300 also has a hardware configuration similar to that of a typical smartphone.
[0023] (Explanation of functional block configuration) Next, with reference to Figure 3, an example of the functional configuration of the terminal device 100 according to this embodiment will be described. Figure 3 is a diagram showing the main functional blocks of the terminal device 100. These functional blocks are mainly realized by the CPU 101 executing programs stored in the storage 103 and cooperating with various hardware components.
[0024] The functional blocks shown in Figure 3 (such as the detection unit 120, voice input unit 121, speech determination unit 122, recording control unit 123, positioning unit 124, and transmission unit 125) represent logical divisions for realizing the functions they are responsible for, and do not necessarily mean that they are implemented as physically separated components. For example, the functions of the speech determination unit 122 and the recording control unit 123 may be realized by a single software module or integrated processing routine executed by a single processor (such as the CPU 101). What is important in the present invention is that each of the functions described in each claim is executed in the terminal device 100.
[0025] Furthermore, the terminal device 100 of this embodiment may further include a log recording unit (not shown) that records information regarding its operating status and processing results as logs in its internal storage 103. For example, the log recording unit records the detection time of the start and cancellation of the recording operation by the detection unit 120, the result of the speech determination unit 122's determination of whether or not speech was uttered and the time thereof, the start, continuation, and end times of recording by the recording control unit 123, the positioning result and time by the positioning unit 124, and the success or failure and time of data transmission by the transmission unit 125. Such log data can be used for verifying the operation of the device and debugging, and can also be referred to as indirect evidence that each function of the present invention has been executed, if necessary.
[0026] The terminal device 100 includes a detection unit 120 that detects user operations, a voice input unit 121 that receives voice input via a microphone 105, a speech determination unit 122 that determines whether or not the user is speaking, a recording control unit 123 that controls the recording process of voice messages, a positioning unit 124 that determines the current location using a GPS receiver 108, and a transmission unit 125 that transmits voice messages and location information to the server device 200.
[0027] The detection unit 120 detects user actions on the operation button 104 (e.g., long press, short press, tap, etc.). Specifically, it detects an operation to start recording a voice message (hereinafter referred to as "recording start operation") and the cancellation of that recording start operation (hereinafter referred to as "recording operation cancellation"). It also detects the recording cancellation operation described later (corresponding to Appendix 9). For example, the recording start operation is a long press of the operation button 104, and the recording operation cancellation is the end of that long press.
[0028] Here, "cancellation of recording operation" means that the active operation state for continuing recording by the user has ended. This is not limited to, for example, when the physical operation button 104 is no longer pressed (e.g., releasing after a long press), but may also include detecting a change in state that suggests the user no longer intends to continue recording, such as when the user's finger pressure falls below a predetermined threshold, or when voice input is interrupted for a predetermined period of time, even if the operation button 104 is still pressed. When the detection unit 120 detects any of these states, it outputs a signal to the recording control unit 123 that triggers a decision to continue recording.
[0029] In this specification, "start of voice message recording operation" is not limited to the long press operation of the physical operation button 104 as described above, but may include various operations that instruct the user to intentionally start recording a voice message, such as the user tapping a specific soft key on the display 107, issuing a specific voice command (e.g., "start recording"), or a predetermined gesture operation. Similarly, "cancellation of the recording operation" is not limited to releasing the physical operation button 104, but may include detecting various state changes that indicate the user has ended active control of the recording, such as the user lifting their finger from a soft key, uttering a voice command to instruct recording to end or send, a gesture operation that corresponds to the operation that instructed recording to start, or even the user's active input to continue recording being interrupted for a predetermined period of time (timeout). The detection unit 120 detects the start and cancellation (or end of active control) of these various operations and uses them as triggers for the recording control unit 123 to decide to continue recording.
[0030] The audio input unit 121 converts the analog audio signal input from the microphone 105 into digital audio data and outputs it to the speech determination unit 122 and the recording control unit 123.
[0031] The speech determination unit 122 determines in real time whether or not the user is speaking (presence or absence of speech) based on the voice data from the voice input unit 121 while the recording control unit 123 is recording. In this specification, "speech" refers to sounds (including words, interjections, calls, etc.) made by the user for the purpose of communication, and is distinguished from mere environmental noise or accidental sounds. Furthermore, "determining whether or not the user is speaking based on voice input" means that the voice data obtained from the voice input unit 121 is used as essential information for determining whether or not speech is being spoken and has a major influence on the determination result. Even if other sensor information (for example, body movement detection information from an acceleration sensor) is referred to as supplementary, as long as the voice data is used as essential and primary information as described above, the determination is made "based on voice input". Based on these determination criteria, the speech determination unit 122 determines whether or not speech is being spoken, for example, by whether or not acoustic characteristics specific to human speech are detected within a predetermined time.
[0032] This speech determination may be based, for example, on whether the volume level of the audio data exceeds a predetermined threshold, or it may use more advanced techniques. As an example of this embodiment, the speech determination unit 122 determines the presence or absence of speech by using a machine learning model for voice activity detection (VAD) (corresponding to Appendix 2). Figure 9 shows an example of a conceptual configuration of a machine learning VAD in such a speech determination unit 122. Audio data from the audio input unit 121 is first input to the audio feature extraction unit 901, where audio features such as MFCC (Mel-Frequency Cepstral Coefficients) and spectrograms are extracted. The extracted audio features are input to the trained VAD model 902. This trained VAD model 902 is trained in advance on a database of many children's audio samples 903 and a database of various environmental noises 904, enabling it to detect children's speech with higher accuracy (corresponding to Appendix 3). The statement that the pre-trained VAD model 902 has "learned from children's voice samples" may mean, for example, that it uses, as part or all, voice data containing various content and emotional states collected from a large number of speakers in an age group generally recognized as children (e.g., around 3 to 12 years old) as part of its training data. Furthermore, as a result of such training, or as a result of the design and tuning of the model, it is preferable that the speech determination unit 122 is configured to exhibit higher detection accuracy (e.g., higher recall and precision, or the harmonic mean F-score, etc.) for speech uttered by children (e.g., speech with specific acoustic characteristics such as a higher fundamental frequency, formant frequencies within a specific range, and a tendency for shorter utterance lengths compared to adults) under predetermined conditions (e.g., under specific types of environmental noise or in situations where multiple speakers are present) compared to a model primarily trained on typical adult speech. Such performance characteristics can be objectively evaluated using specific benchmark tests and evaluation datasets. Therefore, if the speech determination unit 122 actually exhibits such excellent performance characteristics with respect to children's voices, it can serve as indirect evidence that it has "learned children's voice samples," or it can itself constitute an advantageous embodiment of the present invention.The trained VAD model 902 determines whether the current state is a speech interval or a silence (or noise) interval based on the input speech features, and outputs the speech presence / absence determination result 905 to the recording control unit 123. This makes it less likely to miss a child's quiet voice even in a noisy environment, and reduces the possibility of mistakenly identifying sounds other than speech as speech.
[0033] When the recording control unit 123 detects a recording start operation by the detection unit 120, it starts recording audio data from the audio input unit 121. An important feature of this embodiment is that even after the detection unit 120 detects that the recording operation has been canceled, it does not immediately terminate the recording, but instead refers to the judgment result of the speech determination unit 122. As long as the speech determination unit 122 determines that "speech is present," the recording control unit 123 continues recording. When the speech determination unit 122 determines that "no speech is present" (or after the speech termination delay described later has elapsed), the recording control unit 123 terminates the recording and generates the audio data recorded up to that point as a single audio message. Figure 8 is a time chart showing an example of the operation of such recording continuation control. In Figure 8, the horizontal axis represents time, and the state changes in each lane, from top to bottom, are shown for "button operation state," "audio input level," "VAD speech determination," and "recording state." For example, the video shows that when the user starts (presses) a button at time t1, recording begins, and even when the button is released at time t2, recording continues as long as the VAD is detecting speech (until time t3).
[0034] In this specification, when the recording control unit 123 "continues recording," it means that, after the user cancels the recording operation, if the speech determination unit 122 determines that there is speech, it will continue recording the series of voice messages intended by the user without substantial interruption until the natural end of the speech. Here, "without substantial interruption" means that even if buffering of a very short time in milliseconds occurs for internal processing, for example, a continuity is maintained that the user does not perceive as a break in the voice message. Furthermore, "until the natural end of the speech" means, for example, until a predetermined speech end delay time (for example, the period Tdelay in Figure 8) has elapsed after the speech has ended, or until the user performs an explicit recording end operation (for example, the cancellation operation described in Appendices 8 and 9), thereby enabling the recording of the content the user wants to convey to the very end. Configurations that merely extend the recording for a fixed short time after the recording operation is canceled, or configurations in which the recording is interrupted in a way that is perceptible to the user, are intended to be distinguished from the manner in which the recording is "continued" as described herein.
[0035] In the above configuration in which the recording control unit 123 "continues recording," the condition "when the speech determination unit determines that there is speech" is not necessarily limited to a sequence in which a final determination of whether or not there is speech is made in real time immediately after the recording operation is canceled, and the decision on whether or not to continue recording is made based solely on that binary determination result. For example, as a modified example, after the recording operation is canceled, the recording control unit 123 may first unconditionally continue recording for a predetermined short period (for example, about 1 to 2 seconds; hereinafter referred to as the "provisional extension period"), and at the end of this provisional extension period, or during the provisional extension period, the speech determination unit 122 may evaluate whether or not meaningful speech by the user was detected throughout the entire recording period, which includes the recording period while the button was pressed and the provisional extension period. If it is evaluated that meaningful speech was detected throughout the entire recording period, the recording data including the provisional extension period is generated as part of a valid voice message. Conversely, if it is evaluated that no meaningful speech was detected at all, or that there was very little, the recording data may be discarded or transmission may be suppressed. Even with this configuration, the presence or absence of user utterances (or the resulting validity of the recorded content) is evaluated after the recording operation is canceled, and the voice message (or a part thereof) ultimately provided to the user is determined based on that evaluation. This falls within the scope of the present invention's technical idea of continuing recording to ensure message integrity if utterances are detected after the recording operation is canceled. The important point is to provide an opportunity for subsequent utterances to be recorded and used as a voice message without being substantially lost, even if the user unintentionally cancels the recording operation prematurely.
[0036] Furthermore, in the process flow in which the speech determination unit 122 determines whether or not the user has spoken after the recording operation is canceled, and the recording control unit 123 continues recording based on the result, the "determination of whether or not the user has spoken" by the speech determination unit 122 does not necessarily have to be an independent determination step that starts after the recording operation cancellation is completed. For example, the speech determination unit 122 may continuously perform voice activity detection (VAD) based on voice input from the voice input unit 121 from the time the user starts the recording operation, or throughout the recording operation, and monitor the user's speaking state in real time. In this case, when the detection unit 120 detects that the recording operation has been canceled, the recording control unit 123 can directly refer to the result of the speech determination unit 122 (VAD)'s determination of the speaking state at the time of cancellation or for a predetermined period immediately preceding it (i.e., whether or not the user was speaking at the time of cancellation), and immediately continue recording if the user was speaking, or terminate recording if the user was not speaking. In this configuration, the event of canceling the recording operation, the evaluation of the speech state at that time, and the decision to continue or terminate recording occur in close proximity in time, or substantially in conjunction. This, too, is included in the technical concept of the present invention, as it ensures message integrity by deciding whether to continue recording *after* the event of canceling the recording operation unintentionally by the user, taking into account the speech state at that time. What is important is that if the recording operation ends earlier than the user intended, the presence or continuation of speech should be evaluated in some way, and an opportunity to continue recording should be provided so that subsequent speech is not lost.
[0037] In this specification, the statement that the recording control unit 123 "generates a recorded voice message" does not necessarily mean only that it writes the final message object as a self-contained voice file on the internal storage 103 of the terminal device 100. The recording control unit 123 also includes the mode in which it processes the recording data during button press and the voice data recorded by subsequent speech continuation as voice information that should be treated as a continuous or logically unified unit, and prepares or outputs this as a data stream, data packet group, or temporary buffer data for transmission to the server device 200. For example, the recording control unit 123 can encode a series of recorded voice data with an appropriate codec, add metadata such as timestamps and speaker information as needed, and transmit it to the server device 200 via the transmission unit 125 via streaming or sequential transmission in chunks. In this case, even if the server device 200 assembles the final voice message file or playable voice content from this received data, the recording control unit 123 of the terminal device 100 can be interpreted as having "generated" substantial voice information content intended for communication to others in a form suitable for transmission. The important thing is that the recording control unit 123 proactively performs a series of processes that contribute to ensuring that the user's speech is recorded without interruption, transmitted to the server as meaningful units of information, and ultimately transformed into a format that the intended recipient can recognize as a message.
[0038] Furthermore, the recording control unit 123 aims to improve usability and reduce errors by performing the following specific controls. First, even if the speech detection unit 122 determines that "speech has been detected," recording will always continue until the preset minimum guaranteed recording time (for example, the period Tmin in Figure 8) has elapsed (corresponding to Appendix 4). This ensures that even the very short first words a child utters (for example, their first cry for help) are reliably recorded.
[0039] Furthermore, even after the speech detection unit 122 determines that there is "no speech" (for example, silence continues for a certain period of time) (for example, from time t3 in Figure 8 onwards), it does not immediately terminate the recording, but continues recording until a preset speech termination delay time (for example, the period Tdelay in Figure 8) has elapsed (corresponding to Appendix 5). If the speech detection unit 122 determines again that there is "speech" within this delay time, it maintains recording. This prevents the message from being interrupted even if the child pauses to think or stumbles over their words in the middle of speaking. In the example in Figure 8, the delay time ends at time t4 and the recording stops.
[0040] Furthermore, even if the speech determination unit 122 does not immediately determine that "speech has been heard" after the detection unit 120 detects that the recording operation has been canceled (for example, at time t2 in Figure 8), it maintains a recording standby state for a preset buffer time (for example, the period Tbuf in Figure 8). If the speech determination unit 122 determines that "speech has been heard" within this buffer time, it starts (or continues) recording (corresponding to Appendix 6). This allows the system to handle situations where a child pauses for a moment after releasing the button before starting to speak.
[0041] In this specification, various parameters such as minimum guaranteed recording time, speech termination delay time, buffer time, and recording time threshold may be referred to as "pre-set" values. This can encompass a variety of setting methods, including being set as default values during the manufacturing or initial startup of the terminal device 100, being explicitly set and changed by the user through a setting menu, or being remotely set and updated based on instructions from the server device 200. Furthermore, these parameters do not necessarily have to be fixed values. The recording control unit 123 or speech determination unit 122 of the terminal device 100 may dynamically and adaptively adjust them based on past user speech tendencies, the current ambient noise level, the battery level of the terminal device 100, communication status, or statistical information and control signals obtained from the server device 200. For example, the speech termination delay time may be set longer in environments with high noise levels, or the minimum guaranteed recording time may be shortened when the battery level is low. Such dynamic and adaptive parameter adjustments enable optimal recording continuation control in various usage situations, further improving usability and reliability.
[0042] If the generated voice message is less than a preset recording time threshold (for example, less than 0.5 seconds), or if the speech determination unit 122 determines that it did not detect any significant speech intervals throughout the entire recording period (i.e., it determines that it is almost silent or contains only ambient noise), the recording control unit 123 will either suppress the transmission of the voice message to the server device 200, or add warning information that will be displayed on the parent terminal 300 (for example, "short voice" or "no voice") before passing it to the transmission unit 125 (corresponding to Appendix 7). This reduces unnecessary data transmission and the burden of verification on parents.
[0043] If the detection unit 120 detects a user canceling recording (for example, a short press or tap of the operation button 104) during recording (including after the recording operation has been canceled), the recording control unit 123 immediately interrupts the ongoing recording, discards the data recorded up to that point, and controls the transmission unit 125 to prevent transmission to the server device 200 (corresponding to Appendix 8). This recording start operation (e.g., long press) and recording cancel operation (e.g., short press) can be assigned as different types of operations (first operation and second operation) to a single physical operation button 104 (corresponding to Appendix 9). This allows even children to intuitively correct their mistakes.
[0044] The recording control unit 123 can apply noise reduction processing to the voice input received by the voice input unit 121 in real time, or to the generated voice messages in batches (corresponding to Appendix 10). This noise reduction processing suppresses, for example, constant background noise in the surroundings, and when combined with the speech detection unit 122 (VAD) which is particularly specialized for children's voices, it is expected to improve the accuracy of detecting children's speech in noisy environments and the clarity of recorded voice messages.
[0045] Furthermore, in this embodiment, by appropriately combining voice activity detection using a machine learning model that has learned children's voice samples and environmental noise in the speech determination unit 122 (corresponding to Appendix 3), setting of a minimum guaranteed recording time by the recording control unit 123 (corresponding to Appendix 4), setting of a speech end delay time (corresponding to Appendix 5), and noise reduction processing (corresponding to Appendix 10), it is possible to synergistically improve the detection accuracy of children's voice messages, the completeness of recordings, and the clarity of listening, which are difficult to achieve with a single function, especially under particularly harsh conditions (for example, in a play facility where users speak softly and ambient noise is loud) (configuration corresponding to Appendix 12). For example, a high-precision VAD first detects a section that seems to be a child's voice, noise reduction processing improves the signal-to-noise ratio of that section, and various timer controls complement the often interrupted speech characteristic of children, thereby increasing the possibility that even faint voices or incomplete speech that would have been buried in conventional technology can be recorded and transmitted as meaningful messages. This is an extremely effective combination configuration for achieving the objective of the present invention, which is to monitor children.
[0046] The positioning unit 124 uses the GPS receiver 108 to periodically determine location information (latitude, longitude, altitude, positioning time, positioning accuracy, etc.) indicating the current location of the terminal device 100, or to determine it based on instructions from the recording control unit 123. In this specification, when the positioning unit 124 "determines the location information of the user," it is not limited to cases where the terminal device 100 has the function to ultimately calculate its own position coordinates. In addition to a configuration that calculates position coordinates by receiving signals from a satellite positioning system, such as the GPS receiver 108, the positioning unit 124 may broadly include functional components that collect or receive information necessary to determine the location of the terminal device 100 and make it available for transmission to the server device 200, such as a Wi-Fi scan module that scans for information on surrounding Wi-Fi access points and outputs that information, part of a cellular communication module that acquires information on nearby mobile phone base stations, or a communication module for receiving location information from an external device (e.g., a parent's smartphone) paired via short-range wireless communication (e.g., Bluetooth®).
[0047] The transmission unit 125 transmits the voice message generated by the recording control unit 123 and the location information determined by the positioning unit 124 (or collected by the positioning unit 124 for location identification) to a predetermined server device 200 via the communication interface 109. In this case, the transmission unit 125 can compress the voice data into an appropriate format (e.g., AAC or Opus) before transmission in order to reduce the data size of the voice message (as per Appendix 11). The location information and the voice message do not necessarily have to be transmitted at the same time and may be transmitted at timings appropriate to their respective characteristics. For example, it is possible to transmit the location information periodically and the voice message after recording is complete.
[0048] In this embodiment, when the transmission unit 125 transmits the voice message generated by the recording control unit 123 to the server device 200, it is preferable to also transmit location information that is highly relevant to the voice message. Here, "highly relevant location information" refers, for example, to the latest location information determined by the positioning unit 124 at the time the recording of the voice message is completed or at the time the transmission process is started, or to the latest location identification information collected by the positioning unit 124. In particular, voice messages that are recorded to the end using the recording continuation function of the present invention are likely to contain important information that the user wants to convey. Therefore, by associating such voice messages with accurate location information (or equivalent information) at the time the message was generated and transmitting it to the server device 200, parents can gain a more detailed understanding of the situation. The recording control unit 123 may be configured to instruct the positioning unit 124 to acquire (or collect) the latest location information when it has finished generating the voice message, and the transmission unit 125 may transmit the voice message and the acquired (collected) latest location information (or information useful for location identification) to the server device 200 as a pair, or at close intervals.
[0049] The "voice message" transmitted to the server device 200 by the transmission unit 125 preferably includes voice data encoded using a common voice coding scheme such as AAC (Advanced Audio Coding), Opus, MP3 (MPEG-1 Audio Layer III), or AMR (Adaptive Multi-Rate). The voice message may also include metadata such as recording date and time, recording duration, and speaker identification information (e.g., terminal ID). Similarly, the transmitted "location information" preferably includes geographic coordinate information such as latitude, longitude, and altitude calculated directly by the terminal device 100, as well as related information such as positioning time, positioning accuracy, speed, and direction of movement. Alternatively, it may include information that allows the server device 200 to estimate or identify the final location based on this information, such as Wi-Fi access point information, base station information itself, or location information received from a paired external device. This information can be transmitted in a standard format such as NMEA (National Marine Electronics Association) format, or in a structured data format such as XML (Extensible Markup Language) or JSON (JavaScript Object Notation). Thus, the voice messages and location information transmitted from the terminal device 100 are intended to have a format and structure that is substantially usable by the server device 200 after it receives them, for interpretation, storage in a database, distribution to the parent terminal 300, or further information processing (e.g., voice recognition, behavioral analysis). This is distinct from simply transmitting raw sensor data or data in a proprietary format that requires special confidential information for decoding.
[0050] According to this embodiment, the terminal device 100 significantly simplifies the previously cumbersome process of sending voice messages by children and enhances the integrity of messages by having the processor (CPU 101) execute a specific program, thereby concretely improving the computer's functionality. Furthermore, the intuitive single-button operation system (see Appendix 9, Figure 7), recording continuation function, various parameter control (see Figure 8), and the ability to accommodate diverse operating methods (see Appendix 1, not shown) contribute to an improved user interface and realize information processing optimized for the specific operating characteristics of children. These points can be particularly advantageous technical features when considering foreign patent applications.
[0051] (Explanation of data structure) Next, with reference to Figure 4, an example of a data structure used in this embodiment will be described. Figure 4 shows an example of data that can be stored in the storage 103 of the terminal device 100 or the database of the server device 200. The voice message data 401 includes a message ID that uniquely identifies each voice message, the recorded voice data itself (or a pointer indicating its storage location), the date and time of recording, the ID of the sending terminal device, and the ID of the recipient (e.g., parent). The location information data 402 includes a location ID that uniquely identifies each positioning event, latitude, longitude, altitude, positioning date and time, positioning accuracy, and the ID of the associated terminal device. Alternatively, it may include source information for server-side location estimation (such as a Wi-Fi access point information list and base station information). These data may be managed in conjunction with other information as needed (e.g., terminal device battery level, communication status, etc.).
[0052] (Explanation of the processing flow) Next, an example of the processing flow according to this embodiment will be described with reference to Figures 5 and 6. Figure 5 is a flowchart showing an example of the voice message recording and transmission process in terminal device 100. First, when the user performs a recording start operation, such as pressing and holding the operation button 104 (Step S501: Yes), the detection unit 120 detects this and the recording control unit 123 starts recording the audio (Step S502).
[0053] During recording, if the user releases their finger from the button or otherwise cancels the recording operation (step S503: Yes), the detection unit 120 detects this. At this point, the recording control unit 123 does not immediately terminate the recording. First, it starts the buffer time timer (not shown, concept in Appendix 6, see Tbuf in Figure 8). Next, the speech determination unit 122 determines whether the user's speech is continuing based on the voice input (step S504). This determination is made using the machine learning VAD or the like mentioned above.
[0054] If it is determined that speech is continuing (step S504: Yes), the recording control unit 123 continues recording (step S505). In this case, if a minimum guaranteed recording time is set, recording will continue until that time has elapsed (see Appendix 4, Tmin in Figure 8). Then, the process returns to step S504 to monitor whether speech is continuing or not.
[0055] On the other hand, if it is determined in step S504 that the speech has not continued (is interrupted) (step S504: No), the recording control unit 123 starts a speech termination delay timer (not shown, see concept in Appendix 5, see Tdelay in Figure 8). If speech is detected again within this delay time (step S506: Yes), the process returns to step S505 and recording continues. If speech does not resume within the delay time (step S506: No), the recording ends (step S507). Furthermore, if speech begins within the buffer time immediately after canceling the recording start operation, the process will proceed to step S505, which is to continue recording.
[0056] When recording is finished (step S507), the recording control unit 123 generates a voice message based on the recorded voice data. At this time, erroneous recording filtering is performed (step S508, appendix 7). For example, if the recording time is extremely short or there is no meaningful speech segment, transmission is canceled (flow branching) or a special flag is added. If the filtering results indicate that the data is suitable for transmission, noise reduction processing (step S509, note 10) or audio compression processing (step S510, note 11) are applied as needed.
[0057] Then, the positioning unit 124 determines the current location (or collects information useful for location identification) (step S511), and the transmission unit 125 transmits the generated (processed) voice message and location information (or information useful for location identification) to the server device 200 (step S512).
[0058] The series of processing steps shown in Figure 5 above (corresponding to Appendix 13) is a preferred embodiment of a voice message processing method executed by the processor 101 of the terminal device 100. This processing method not only allows the terminal device 100 to function standalone, but also provides core client-side (terminal device-side) processing for realizing a cloud-distributed service that works in conjunction with the server device 200 in a voice message system 1 as shown in Figure 1. Specifically, by this processing method, voice messages and highly relevant location information (or information useful for location identification) appropriately generated, filtered, and compressed by the terminal device 100 are transmitted to the server device 200, enabling the server device 200 to provide cloud service-specific functions such as efficient data management, reliable notification to the parent terminal 300, and provision of useful information. In this way, this processing method plays an important role on the terminal side in a cloud-distributed system, contributing to the smooth operation and added value enhancement of the entire system. Furthermore, if the user cancels the recording at any point between the start and end of the recording (step S513: Yes, see notes 8, 9), the recording process will be interrupted and the message will not be sent (step S514).
[0059] Figure 6 is a sequence diagram showing an example of the main processing sequence in the voice messaging system 1. (1) The person being monitored (user) initiates the recording start operation on the terminal device 100. (2) The recording control unit 123 of the terminal device 100 starts recording. (3) The user cancels the recording operation. (4) The speech determination unit 122 and the recording control unit 123 of the terminal device 100 perform speech continuation determination and recording continuation processing (see S504 to S506 in Figure 5 and Figure 8). (5) After the speech is finished, the terminal device 100 generates a voice message and the positioning unit 124 determines (or collects) location information. (6) The transmitting unit 125 of the terminal device 100 transmits a voice message and location information (or information useful for location identification) to the server device 200. (7) The server device 200 stores the received voice messages and location information (or information useful for location identification) in a database or the like. Figure 10 shows an example of the functional configuration of such a server device 200, which includes a communication unit 1001, a data receiving unit 1002, a data storage unit 1003 (e.g., a database), a data processing unit 1004, an authentication unit 1005, a notification control unit 1006, a data transmission unit 1007, etc. The data receiving unit 1002 receives data from the terminal device 100 and stores it in the data storage unit 1003. Preferably, the server device 200 not only receives voice messages and location information (or information useful for location identification) from the terminal device 100, but also has the function of securely storing and managing this information in the data storage unit 1003 in association with the identification information of the person being monitored or the guardian. Furthermore, the notification control unit 1006 can proactively notify the associated guardian terminal 300 using various notification methods such as push notifications, email, and SMS (Short Message Service) when it receives a new voice message from the terminal device 100, or when location information (or a location estimated therefrom) meets specific conditions (e.g., entering or leaving a pre-set geofence area). The data processing unit 1004 may also have functions to process received voice messages, such as speech recognition processing to convert them into text, detecting specific keywords (e.g., "help," "it hurts") to determine urgency and increase the notification priority, or performing sentiment analysis. Similarly, it can also perform data processing based on accumulated location information (or a location estimated therefrom) to provide the guardian terminal 300 with functions such as analysis of the monitored person's behavior patterns and display of movement history. The data transmission unit 1007, in response to a request from the parent terminal 300, transmits voice message data, location information (real-time, historical, or estimated location information), and analysis results from the data processing unit 1004 in a format that can be displayed by the application on the parent terminal 300, after appropriate authentication processing by the authentication unit 1005. Through this processing on the server device 200 side, the voice message system 1 can provide a more advanced monitoring service, going beyond mere information transmission and significantly improving the convenience and peace of mind of parents. (8) The notification control unit 1006 of the server device 200 notifies the parent terminal 300 of the arrival of a new message (for example, a push notification). (9) The user of the parent terminal 300 operates the parent app and requests message confirmation. (10) The data transmission unit 1007 of the server device 200 transmits the stored voice message and location information to the parent terminal 300. (11) The parent terminal 300 plays received voice messages and displays location information on a map, etc., using the functions of the parent app. The functions of this parent terminal 300 will be described later with reference to Figures 11 and 12.
[0060] (Examples of parental device functionality and UI) Next, with reference to Figures 11 and 12, the functional configuration of the parent terminal 300 and an example of the UI (User Interface) screen of the parent app will be described.
[0061] Figure 11 shows the main functional blocks of the parent terminal 300. The parent terminal 300 includes a communication unit 1101 for communicating with the server device 200, a data processing unit 1102 for processing received data and application data, an audio playback unit 1103 for playing voice messages, a display control unit 1104 for displaying information on the display, an operation input unit 1105 for receiving touch operations from the user, and a control unit 1106 for comprehensively controlling these operations. These functions are realized by the parent terminal 300's processor executing a parent application (program).
[0062] The control unit 1106 receives voice messages generated by the terminal device 100 and location information of the person being monitored from the server device 200 via the communication unit 1101. The data processing unit 1102 processes the received voice messages into a format that can be played back by the voice playback unit 1103, and processes the received location information into a format that can be displayed on a map or the like by the display control unit 1104.
[0063] Figures 12A and 12B show examples of the UI screen of the parent app. Figure 12A is an example of the list display screen 1201 for received voice messages, where each message may display the sender (such as the name and icon of the person being monitored), the date and time of receipt, and the length of the message. When the user selects a specific message, the screen transitions to the message playback and location information display screen 1202, as shown in Figure 12B. On this screen, a play button 1203 for playing the voice message, a seek bar 1204 indicating the playback status, etc., are displayed, and the location information of the person being monitored associated with the voice message is displayed on the map 1205 with a marker 1206, etc. The parent can intuitively understand the location at the time the message was sent while listening to the audio.
[0064] Although embodiments of the present invention have been described above, the present invention is not limited to the above embodiments, and various modifications are possible without departing from the spirit of the invention. For example, although the above embodiments mainly described a case in which the terminal device is carried by a child, it can be applied to various uses such as for the elderly or other people who need assistance, or for communication in specific work environments. Furthermore, the specific algorithms and thresholds for speech recognition, the time settings for various timers, the criteria for filtering erroneous recordings, and the specific methods for noise reduction and audio compression can be optimized as appropriate depending on the application, target users, and hardware performance. Furthermore, some of the components and processing steps described in the above embodiments can be omitted, combined with other elements or steps, or their order can be changed. For example, the timing of location information acquisition may be performed periodically, independently of the recording of the voice message, or the latest location information may be acquired at the time of voice message transmission. The functions of each part described in the above embodiment may be implemented by a single hardware entity or distributed across multiple hardware entities. Furthermore, the components of the various embodiments described herein can be appropriately combined and implemented by those skilled in the art.
[0065] [General tasks] To improve the reliability of communication in mobile devices equipped with voice messaging capabilities. This invention aims to prevent unintended interruptions in voice message recording due to user actions on mobile devices with voice messaging capabilities, thereby preventing the transmission of incomplete information.
[0066] [Issues corresponding to Appendix 1] To provide a basic configuration that enhances the reliability of communication in monitoring applications by continuing recording if speech continues even after the user has canceled the recording operation, thereby improving the integrity of the message and transmitting it along with location information. [Note 1] A detection unit that detects when a user starts recording a voice message and when the recording operation is canceled, A voice input unit that accepts voice input, A speech determination unit determines whether or not the user has spoken based on the voice input from the voice input unit after the recording operation has been canceled. If the speech determination unit determines that there is speech, the recording control unit continues recording following the start of the recording operation and generates a recorded voice message. A positioning unit that determines the location information of the user, A transmission unit that transmits the generated voice message and the measured location information to a predetermined server, A terminal device equipped with the following features. (Effects of Appendix 1) Even after the user cancels the recording operation, recording continues as long as speech is ongoing, thus suppressing unintended interruptions to the recording and enhancing the integrity of the message. Furthermore, by transmitting location information along with the message, guardians can accurately understand the situation and location of the person being monitored. Here, the positioning unit equipped in the terminal device is not necessarily limited to one that can calculate location coordinates independently, but may also include a function to collect and provide information necessary to ultimately identify the user's location in cooperation with a server, etc., thereby providing a practical location identification function while suppressing the cost and power consumption of the terminal device. In addition, this process of "determining whether speech is being spoken after the recording operation is canceled...and continuing the recording" is not necessarily limited to cases where the recording operation cancellation event, the speech presence / absence determination event, and the recording continuation control event are executed sequentially and completely separately in time. For example, it also includes a mode in which the decision to continue recording and execution are performed substantially without interruption based on the speech state at the time the recording operation is canceled, thereby ensuring message integrity while achieving highly responsive recording continuation control. Furthermore, the term "generates" a voice message by this terminal device does not only mean generating it as a complete file on the terminal, but also includes outputting it as a voice data stream or data segment suitable for transmission, with the understanding that it will be assembled into a final message on the server. This can contribute to improved real-time performance and efficient use of terminal resources.
[0067] [Issues corresponding to Appendix 2] To improve the accuracy of determining whether or not speech has been uttered. [Note 2] The terminal device described in Appendix 1, The speech determination unit determines whether or not the user is speaking by detecting voice activity using a machine learning model. Terminal device. (Effects due to Appendix 2) By using machine learning models for speech activity detection, the presence or absence of speech can be determined with higher accuracy than simple volume-based judgments, reducing false detections in noisy environments and improving the reliability of continued recording.
[0068] [Issues corresponding to Appendix 3] In particular, we aim to further improve the accuracy of determining whether or not a child is speaking. [Note 3] The terminal device described in Appendix 1, The speech determination unit determines whether or not the user is speaking by detecting speech activity using a machine learning model that has learned from children's voice samples and environmental noise. Terminal device. (Effects of Appendix 3) By using machine learning models specifically tailored to children's speech characteristics and usage environments, we can more reliably detect children's speech, improve their ability to distinguish it from adult speech and other sounds, and enhance the performance of the device as a child-friendly device.
[0069] [Issues corresponding to Appendix 4] Be sure to record even the shortest, most important remarks your child makes. [Note 4] The terminal device described in Appendix 1, If the recording control unit determines that speech has been heard, it will continue recording for a predetermined minimum guaranteed recording time. Terminal device. (Effects of Appendix 4) It ensures that even the first few words a child hesitantly utters, or fragments of SOS signals, are recorded, preventing any messages from being missed.
[0070] [Issues corresponding to Appendix 5] This prevents the recording from being interrupted when a child stumbles over their words or pauses briefly while speaking. [Note 5] The terminal device described in Appendix 1, The recording control unit continues recording until a preset utterance termination delay time has elapsed, after the utterance determination unit has determined that the user's utterance has been interrupted. If the utterance determination unit detects utterance again within the delay time, the recording control unit maintains the recording. Terminal device. (Effects of Appendix 5) It accommodates children's unique speaking styles, prevents recordings from ending abruptly at natural pauses in conversation, and enhances the completeness of the message.
[0071] [Issues corresponding to Appendix 6] Address situations where a child's timing between pressing a button and starting to speak is out of sync. [Note 6] The terminal device described in Appendix 1, The recording control unit, after the detection unit detects the cancellation of the recording operation, will continue recording if the speech determination unit determines that the user has started speaking within a predetermined buffer time. Terminal device. (Effects of Appendix 6) Allowing a short time lag between releasing the button and starting to speak prevents the beginning of a message from being missed, improving usability.
[0072] [Issues corresponding to Appendix 7] To prevent the sending of accidentally recorded and unnecessary voice messages, thereby reducing the burden on parents for verification and minimizing wasted communication resources. [Note 7] The terminal device described in Appendix 1, If the generated voice message is less than a preset recording time threshold, or if the speech determination unit determines that no significant speech interval has been detected, the recording control unit will either suppress transmission to the server or add warning information and have the transmission unit transmit the message. Terminal device. (Effects of Appendix 7) This prevents the transmission of essentially meaningless voice messages and recordings consisting only of noise, improving system efficiency and reliability.
[0073] [Issues corresponding to Appendix 8] To make it easy to stop recording if a child accidentally starts recording. [Note 8] The terminal device described in Appendix 1, The detection unit further detects the user's recording cancellation operation, When the recording control unit detects the recording cancellation operation, it interrupts the ongoing recording and controls the transmission unit to prevent the generated voice message from being sent to the server. Terminal device. (Effects of Appendix 8) This provides users with a way to cancel recordings themselves in case of accidental operation, improving peace of mind and ease of use.
[0074] [Issues corresponding to Appendix 9] To provide a simple user interface that allows even children to intuitively start and cancel recording. [Note 9] The terminal device described in Appendix 1, The start of the recording operation detected by the detection unit is a first operation on a specific physical control, and the recording cancellation operation detected by the detection unit is a second operation on the physical control that is different from the first operation. Terminal device. (Effects of Appendix 9) One or a few physical buttons enable key voice control, improving ease of use for children.
[0075] [Issues corresponding to Appendix 10] To improve the clarity of voice messages recorded in noisy environments. [Note 10] The terminal device described in Appendix 1, The recording control unit applies noise reduction processing to the audio input received by the audio input unit, or applies noise reduction processing to the generated audio message. Terminal device. (Effects of Appendix 10) By reducing the impact of ambient noise and allowing parents to hear their children's voices more clearly, the quality of communication is improved.
[0076] [Issues corresponding to Appendix 11] To reduce the amount of data used when sending voice messages and improve communication efficiency. [Note 11] The terminal device described in Appendix 1, The transmission unit compresses the generated voice message and then transmits it to the server. Terminal device. (Effects of Appendix 11) By reducing the amount of data transmitted, the success rate of transmission is increased even in locations with unstable communication environments, and communication costs and server storage load are reduced.
[0077] [Issues corresponding to Appendix 12] To synergistically improve the detection accuracy of children's voice messages, the integrity of recordings, and the clarity of listening, especially under challenging conditions. [Note 12] The terminal device described in Appendix 1, The speech determination unit determines whether or not the user is speaking by detecting speech activity using a machine learning model that has learned children's voice samples and environmental noise. When the recording control unit determines that speech has been heard, it continues recording based on a preset minimum guaranteed recording time and a preset speech termination delay time, and applies noise reduction processing to the voice input received by the voice input unit or the generated voice message. Terminal device. (Effects of Appendix 12) By combining child-specific VAD, primary timer control, and noise reduction processing, the reliability and quality of message transmission under harsh conditions, which would be difficult to achieve with individual functions, are significantly improved.
[0078] [Issues corresponding to Appendix 13] (Same issues as in Appendix 1) [Note 13] The way in which a processor executes A step of detecting when a user starts recording a voice message and when the recording operation is canceled, Steps to accept voice input, The steps include: determining whether or not the user has spoken based on the received voice input after canceling the recording operation; If the step of determining whether or not an utterance is present determines that an utterance is present, the recording following the start of the recording operation is continued, and a recorded voice message is generated. The steps include determining the location information of the user, The steps include transmitting the generated voice message and the measured location information to a predetermined server, respectively. A method for processing voice messages, including the processing of voice messages. (Effects of Appendix 13) (Same effect as in Appendix 1)
[0079] [Issues corresponding to Appendix 14] (The same problem as in Appendix 1 is addressed in the program.) [Note 14] In the processor, The system detects when the user initiates voice message recording and when the recording operation is terminated. Allow voice input, After canceling the recording operation, the system will determine whether or not the user has spoken based on the received voice input. If the step of determining whether or not an utterance is present determines that an utterance is present, the recording following the start of the recording operation is continued, and a recorded voice message is generated. The location information of the aforementioned user is determined, The generated voice message and the measured location information are to be transmitted to a predetermined server, respectively. A program to execute a process. (Effects of Appendix 14) (Provides the same effect as in Appendix 1 as a program)
[0080] [Issues corresponding to Appendix 15] (Similar to the problem described in Appendix 1, implemented as a system) [Note 15] A detection unit that detects when a user starts recording a voice message and when the recording operation is canceled, A voice input unit that accepts voice input, A speech determination unit determines whether or not the user has spoken based on the voice input from the voice input unit after the recording operation has been canceled. If the speech determination unit determines that there is speech, the recording control unit continues recording following the start of the recording operation and generates a recorded voice message. A positioning unit that determines the location information of the user, A transmitting unit that transmits the generated voice message and the measured location information, respectively. A terminal device equipped with, A server device that receives the voice message and location information from the terminal device, A voice messaging system that includes this. (Effects of Appendix 15) (Provides the same effect as in Appendix 1 as a system)
[0081] [Issues corresponding to Appendix 16] The parent's device should appropriately receive voice messages and related location information that have been enhanced in the device's reliability, and present them in a format that is easy for the parent to understand. [Note 16] In the processor, From a designated server device, the system receives a voice message generated by continuing recording when the user continues speaking even after the user cancels the voice message recording operation on the terminal device carried by the person being monitored, as well as the location information of the person being monitored. The received audio message is processed to make it playable, The received location information is displayed on the display unit. A program to execute a process. (Effects of Appendix 16) On the parent's device, uninterrupted voice messages and associated location information are received from the server device, the voice messages are played, and the location information is displayed, allowing parents to more accurately and confidently understand the situation of the person being monitored. [Explanation of symbols]
[0082] 1. Voice messaging system 100 terminal devices 101 CPU 102 RAM 103 Storage 104 Operation Buttons 105 Microphone 106 speakers 107 displays 108 GPS receiver 109 Communication Interface 110 Bus 120 Detection unit 121 Voice Input Section 122 Speech determination unit 123 Recording Control Unit 124 Positioning Unit 125 Transmitter 200 Server Devices 300 Parental Devices 401 Voice message data 402 Location data 701 Recording icon 702 Signal strength icon 703 Battery level icon 704 Time display 901 Speech Feature Extraction Unit 902 Pre-trained VAD Models 903 Children's Voice Database 904 Environmental Noise DB 905 Speech Presence / Absence Determination Result 1001 (Server device) Communications section 1002 (Server device) Data receiving unit 1003 Data storage unit (of the server device) 1004 (Data processing unit of the server device) 1005 Authentication section (of the server device) 1006 (Server device) Notification control unit 1007 (Server device) Data transmission section 1101 (Parental device) Communications section 1102 (Parental device) Data processing unit 1103 (Parental device) Audio playback unit 1104 Display control unit (of parental terminal) 1105 (Parental device) Operation input section 1106 (Parental device) control unit 1201 Message List Display Screen 1202 Message playback / location information display screen 1203 Play button 1204 Seek Bar 1205 Map 1206 markers NW (Network Communication Network) S501~S514 Step Tmin Minimum Guaranteed Recording Time Tdelay: Delay time after the end of speech Tbuf buffer time t1~t4 time
Claims
1. A detection unit that detects when a user starts recording a voice message and when the recording operation is canceled, A voice input unit that accepts voice input, A speech determination unit determines whether or not the user has spoken based on the voice input from the voice input unit after the recording operation has been canceled. If the speech determination unit determines that there is speech, the recording control unit continues recording following the start of the recording operation and generates a recorded voice message. A positioning unit that determines the location information of the user, A transmission unit that transmits the generated voice message and the measured location information to a predetermined server, A terminal device equipped with the following features.
2. A terminal device according to claim 1, The speech determination unit determines whether or not the user is speaking by detecting voice activity using a machine learning model. Terminal device.
3. A terminal device according to claim 1, The speech determination unit determines whether or not the user is speaking by detecting speech activity using a machine learning model that has learned from children's voice samples and environmental noise. Terminal device.
4. A terminal device according to claim 1, If the recording control unit determines that speech has been heard, it will continue recording for a predetermined minimum guaranteed recording time. Terminal device.
5. A terminal device according to claim 1, The recording control unit continues recording until a preset utterance termination delay time has elapsed, after the utterance determination unit has determined that the user's utterance has been interrupted. If the utterance determination unit detects utterance again within the delay time, the recording control unit maintains the recording. Terminal device.
6. A terminal device according to claim 1, The recording control unit, after the detection unit detects the cancellation of the recording operation, will continue recording if the speech determination unit determines that the user has started speaking within a predetermined buffer time. Terminal device.
7. A terminal device according to claim 1, If the generated voice message is less than a preset recording time threshold, or if the speech determination unit determines that no significant speech interval has been detected, the recording control unit will either suppress transmission to the server or add warning information and have the transmission unit transmit the message. Terminal device.
8. A terminal device according to claim 1, The detection unit further detects the user's recording cancellation operation, When the recording control unit detects the recording cancellation operation, it interrupts the ongoing recording and controls the transmission unit to prevent the generated voice message from being sent to the server. Terminal device.
9. A terminal device according to claim 1, The start of the recording operation detected by the detection unit is a first operation on a specific physical control, and the recording cancellation operation detected by the detection unit is a second operation on the physical control that is different from the first operation. Terminal device.
10. A terminal device according to claim 1, The recording control unit applies noise reduction processing to the audio input received by the audio input unit, or applies noise reduction processing to the generated audio message. Terminal device.
11. A terminal device according to claim 1, The transmission unit compresses the generated voice message and then transmits it to the server. Terminal device.
12. A terminal device according to claim 1, The speech determination unit determines whether or not the user is speaking by detecting speech activity using a machine learning model that has learned children's voice samples and environmental noise. When the recording control unit determines that speech has been heard, it continues recording based on a preset minimum guaranteed recording time and a preset speech termination delay time, and applies noise reduction processing to the voice input received by the voice input unit or the generated voice message. Terminal device.
13. The way in which a processor executes A step of detecting when a user starts recording a voice message and when the recording operation is canceled, Steps to accept voice input, The steps include: determining whether or not the user has spoken based on the received voice input after canceling the recording operation; If the step of determining whether or not an utterance is present determines that an utterance is present, the recording following the start of the recording operation is continued, and a recorded voice message is generated. The steps include determining the location information of the user, The steps include transmitting the generated voice message and the measured location information to a predetermined server, respectively. A method for processing voice messages, including the processing of voice messages.
14. In the processor, The system detects when the user initiates voice message recording and when the recording operation is terminated. Allow voice input, After canceling the recording operation, the system will determine whether or not the user has spoken based on the received voice input. If the step of determining whether or not an utterance is present determines that an utterance is present, the recording following the start of the recording operation is continued, and a recorded voice message is generated. The location information of the aforementioned user is determined, The generated voice message and the measured location information are to be transmitted to a predetermined server, respectively. A program to execute a process.
15. A detection unit that detects when a user starts recording a voice message and when the recording operation is canceled, A voice input unit that accepts voice input, A speech determination unit determines whether or not the user has spoken based on the voice input from the voice input unit after the recording operation has been canceled. If the speech determination unit determines that there is speech, the recording control unit continues recording following the start of the recording operation and generates a recorded voice message. A positioning unit that determines the location information of the user, A transmitting unit that transmits the generated voice message and the measured location information, respectively. A terminal device equipped with, A server device that receives the voice message and location information from the terminal device, A voice messaging system that includes this.
16. In the processor, From a designated server device, the system receives a voice message generated by continuing recording when the user continues speaking even after the user cancels the voice message recording operation on the terminal device carried by the person being monitored, as well as the location information of the person being monitored. The received audio message is processed to make it playable, The received location information is displayed on the display unit. A program to execute a process.