Information processing device, voice message processing method, program, and information processing system

The information processing device addresses the challenge of managing voice messages across devices with different interfaces by enabling seamless transfer and storage, ensuring efficient and user-friendly management of voice messages.

JP2026109511APending Publication Date: 2026-07-01MIXI INC

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
MIXI INC
Filing Date
2025-07-01
Publication Date
2026-07-01

Smart Images

  • Figure 2026109511000001_ABST
    Figure 2026109511000001_ABST
Patent Text Reader

Abstract

This system provides a systematic mechanism for permanently saving and utilizing voice messages sent from simple devices such as GPS terminals, without requiring users to take any extra effort. Conventional technologies had problems such as voice messages being lost after the server's storage period expired, or users having to manually save them. [Solution] An information processing device receives voice messages transmitted from a GPS terminal and forwards them to a display terminal. The information processing device extracts voice characteristics, keywords, contextual information, etc., from the voice message and scores the message's value for saving. For messages with high scores, it sends a notification to the display terminal recommending saving. In addition, upon request from the display terminal, it sends the voice message in a data format that can be saved. This prevents users from missing valuable messages and allows them to easily save and utilize data.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present invention relates to an information processing apparatus, particularly an information processing apparatus that processes voice messages transmitted and received between user terminals, as well as related voice message processing methods and programs. More specifically, the present invention relates to the fields of communication systems, multimedia processing, natural language processing, machine learning, and artificial intelligence.

Background Art

[0002] In recent years, mobile terminals such as smartphones and dedicated terminals equipped with GPS (Global Positioning System) functions (hereinafter referred to as "GPS terminals") have become widely popular. Using these terminals, users can easily transmit and receive voice messages. For example, a communication service is known in which a guardian gives a GPS terminal to a child and the guardian's smartphone receives the voice messages sent by the child.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] In such services, transmitted and received voice messages are usually temporarily stored on a server (cloud server) on the network, and users can play and listen to the messages for a certain period of time. However, due to the management of the cloud server, voice messages may be deleted after a predetermined data retention period has elapsed. In that case, there was a risk that important messages that users wanted to listen to again later would be lost. Furthermore, even when users saved voice messages to their own devices, they had to explicitly perform the saving operation, which was cumbersome. In particular, there was a lack of a systematic mechanism for permanently and easily saving and utilizing voice messages sent from devices with limited input interfaces, such as GPS terminals, on devices with rich interfaces, such as smartphones. This invention has been made in view of the problems of the prior art described above, and aims to provide a novel information processing device, a voice message processing method, and a program that can smoothly link voice message data between devices with different characteristics in the field of information processing, enable users to save and utilize such data in a desired format and easily, and thereby improve user convenience. [Means for solving the problem]

[0005] To solve the above problems, an information processing device according to one aspect of the present invention is an information processing device that communicates with a user terminal including a display terminal having a display unit and a GPS terminal, and processes voice messages transmitted and received between the user terminals, comprising: a transfer means for transferring a voice message transmitted from the GPS terminal to the display terminal; and a data transmission means for transmitting the voice message in a data format that can be stored on the display terminal in response to a request from the display terminal. [Brief explanation of the drawing]

[0006] [Figure 1] This is an overview diagram showing the overall configuration of the voice message processing system according to this embodiment. [Figure 2]This is a block diagram showing an example of the hardware configuration of the information processing device according to this embodiment. [Figure 3] This is a block diagram showing the functional configuration of the information processing device according to this embodiment. [Figure 4] This is a sequence diagram showing the processing flow in this embodiment. [Figure 5] This is a flowchart of the value evaluation and notification processes performed by the information processing device according to this embodiment. [Figure 6] This figure shows an example of the audio playback screen of the display terminal according to this embodiment. [Figure 7] This figure shows an example of a notification screen with an action button according to this embodiment. [Figure 8] This figure shows an example of a data structure used in this embodiment. [Figure 9] This diagram shows the flow of the video generation process in this embodiment. [Figure 10] This figure shows an example of a message list screen according to this embodiment. [Figure 11] This figure shows an example of the video generation settings screen according to this embodiment. [Figure 12] This figure shows an example of the storage policy settings screen according to this embodiment. [Modes for carrying out the invention]

[0007] Embodiments of the present invention will be described in detail below with reference to the drawings.

[0008] (System Configuration) Figure 1 is an overview diagram showing the overall configuration of the voice message processing system 1 according to this embodiment. As shown in Figure 1, the system 1 is configured such that an information processing device 100 and a user terminal 200 are connected to each other via a network NW so that they can communicate with one another.

[0009] The user terminal 200 includes a GPS terminal 210 used by users such as children, and a display terminal 220 used by users such as guardians. The GPS terminal 210 is a small terminal equipped with GPS functionality, a microphone, and communication functionality, and is primarily used for recording and transmitting voice messages. The display terminal 220 is, for example, a smartphone or tablet terminal, and is equipped with a display unit 221, a speaker, a microphone, and communication functionality, and has dedicated application software (hereinafter referred to as "app") installed.

[0010] The information processing device 100 is, for example, a cloud server composed of one or more server computers, and functions as the core of this system 1. Figure 2 is a block diagram showing an example of the hardware configuration of the information processing device 100. The information processing device 100 is a known computer equipped with a CPU (Central Processing Unit) 101, RAM (Random Access Memory) 102, ROM (Read Only Memory) 103, a communication interface 104, and storage 105. The storage 105 stores a program for executing the processing according to the present invention and various data (keyword dictionary, standard model, etc., described later). The CPU 101 reads this program into RAM 102 and executes it, thereby realizing each of the functional units described later.

[0011] (Functional Configuration) The voice message processing system of the present invention achieves its objectives as a whole through the collaborative operation of the information processing device 100 and the user terminal 200. While the functional units of the information processing device 100 described below (e.g., message analysis unit 130, scoring unit 140, etc.) are described in this embodiment using the example of implementation on the information processing device 100, the technical concept of the present invention is not limited to this. Importantly, a configuration in which some or all of these functions are performed on the display terminal 220 communicating with the information processing device 100 is also within the scope of the present invention. For example, a distributed processing configuration in which the information processing device 100 handles basic data transfer, while applications on the display terminal 220 handle more advanced processing such as analysis and scoring to determine the value of voice messages, or video generation, is also one embodiment of the present invention. Even with such a distributed processing configuration, the information processing device 100 and the display terminal 220 cooperate via a network NW, and the entire system analyzes and evaluates voice messages transmitted from the GPS terminal, encouraging users to save and utilize them. This solves the problem that the present invention aims to address and produces the same effects. Therefore, the processing described herein as being performed by the "information processing device" may include cases where a portion of the processing is performed on the display terminal side, as long as the processing is performed by the entire system.

[0012] Figure 3 is a block diagram showing the functional configuration of the information processing device 100. The information processing device 100 includes a transfer unit 110, a data transmission unit 120, a message analysis unit 130, a scoring unit 140, a notification unit 150, a standard model generation unit 160, a video generation unit 170, and an external linkage unit 180. These functional units are realized by the CPU 101 executing a predetermined program or by dedicated hardware circuits. Details of each functional unit will be described below.

[0013] <Transfer unit 110 (transfer means) and data transmission unit 120 (data transmission means)> The transfer unit 110 and the data transmission unit 120 are responsible for the basic data relay function in this system.

[0014] The transfer unit 110 receives the voice message transmitted from the GPS terminal 210 by means of an HTTP(S) request or the like, and transfers the message to the display terminal 220 which is the destination. Specifically, the received voice data is stored in the storage 105, and the metadata of the voice message (sender information, receiver information, timestamp, etc.) is recorded in the message table 802 shown in FIG. 8. Then, using a push notification service such as Apple Push Notification Service (APNs) or Firebase Cloud Messaging (FCM), a payload indicating that a new message has been received is transmitted to the app of the display terminal 220. Thereby, the user can know the reception of the message in real time.

[0015] Note that the payload of the push notification transmitted by the transfer unit 110 to the display terminal 220 can include not only information indicating the existence of a new message, but also an outline or a part of the accompanying information (for example, location information, timestamp, status information such as the remaining battery level of the GPS terminal) received together with the voice message from the GPS terminal 210. Thereby, the user of the display terminal 220 can grasp the general situation on the notification screen before opening the app, and more smooth information cooperation that contributes to the purpose of this system such as the safety confirmation of the child using the GPS terminal is realized.

[0016] In addition, "transfer" in this embodiment refers to a series of processes that make the presence or the message itself of the voice message received by the information processing device from the GPS terminal recognizable to the display terminal, and is not limited to a specific implementation method. For example, in the above, the form of sending a push notification from the information processing device to the display terminal (push type) is exemplified, but it is not limited to this. As a modification, an application on the display terminal side may periodically inquire about the presence of new messages (polling) to the information processing device, and the information processing device may respond to the inquiry and notify the presence of the message (pull type). Even in such a pull-type configuration, in terms of the voice message transmitted from the GPS terminal being finally delivered to the display terminal via the information processing device, it is included within the scope of the technical idea of the present invention of "transferring" a message from the GPS terminal to the display terminal. In any form, it solves the problem of smoothly coordinating voice message data between terminals with different characteristics and exhibits the same operational effects.

[0017] Note that in this specification, "transfer" is not limited to the mode of directly relaying the data body of the voice message received by the information processing device from the GPS terminal and transmitting it to the display terminal. It includes all processes in which the information processing device notifies the presence of the voice message transmitted from the GPS terminal or its location information (e.g., URL for data acquisition) in a form recognizable to the display terminal. Furthermore, a configuration in which the information processing device mediates and manages direct communication (P2P communication) between user terminals, and as a result, the voice message is delivered to the display terminal, may also be included in one aspect of "transfer" in that the information processing device substantially controls the arrival of the message.

[0018] The data transmission unit 120, in response to a request from the display terminal 220, sends the voice message stored in the storage 105 to the display terminal 220 in a data format that can be stored on the display terminal 220. For example, as shown in the example screen in Figure 6, when a user taps the download button 601 on the voice playback screen 600 of the app on the display terminal 220, the app sends a download request including the target message ID to the API endpoint (e.g., / api / v1 / messages / {message_id} / download).

[0019] More specifically, the download request transmitted from the display terminal 220 may include parameters specifying the desired data format and bitrate, depending on the user's selected storage purpose on the app (e.g., "Save in high quality," "Share on social media") and the current communication environment. The data transmission unit 120 interprets these parameters and dynamically encodes the data from a single original audio data source stored in the storage 105 as requested, generating and transmitting a wide variety of data formats, such as uncompressed WAV format for permanent storage or MP3 format with a smaller file size for social media sharing. This dynamic format conversion function allows users to obtain audio data in the most optimal form according to their purpose and circumstances, achieving the objective of the present invention, which is to "save and utilize the data in a desired format and in a convenient manner."

[0020] Furthermore, in this embodiment, "in response to a request from the display terminal" broadly refers to the system reflecting the user's intention to listen to or save a message, and is not necessarily limited to a sequence in which data transmission occurs after an explicit operation by the user. As a further modification, the data transmission unit 120 may, when the transfer unit 110 notifies the display terminal of the existence of a voice message, include the voice data itself or a part of it in the notification (e.g., the payload of a push notification) and send it to the display terminal in advance, where it is temporarily cached on the display terminal side. In this case, the act of the application reading the cached data on the display terminal side is considered to be a substantial "request" that reflects the user's intention to listen to or save the message, and the entire series of processes that deliver the data in advance to realize that request is included within the scope of the technical idea of ​​the present invention, "transmitting in response to a request." This configuration is beneficial in that it reduces the user's waiting time and provides a more comfortable playback experience.

[0021] When the data transmission unit 120 receives this request, it reads the audio data corresponding to the message ID from the storage 105, encodes it in the data format specified in the request or the default format (for example, uncompressed WAV format, compressed MP3, AAC format, etc.), and sends it back to the display terminal 220 as the body of the HTTP response. This allows the user to save the audio file to their device.

[0022] As a variation, the data transmission unit 120 can also transmit not only audio data, but also text data and various feature quantities generated by the message analysis unit 130 (described later) in JSON format.

[0023] Here, "in response to a request from the display terminal" is not limited to requests arising from explicit user actions (e.g., tapping the download button 601). It refers to a broader concept in which, within a series of actions by the user interacting with the application, the system detects or infers the user's potential intention to listen or save, and in response initiates data transmission. For example, the action by which the application requests corresponding audio data from an information processing device in the background, triggered by the user viewing a list of new messages or focusing on a specific message, is also considered a form of "request" that reflects the user's potential intention.

[0024] Furthermore, in this specification, "storable data format" refers to a non-exclusive or commonly used file format (e.g., WAV, MP3, AAC format, etc.) that can be used by the user in other applications or copied externally via a file management system (so-called file manager or gallery app, etc.) provided as standard by the operating system of the display terminal. Even if the data transmitted from the information processing device is temporarily protected by technical protection measures such as encryption, if the user can ultimately obtain the data permanently in the above-mentioned general-purpose file format through a legitimate function of the application on the display terminal (e.g., export function), this shall be included in "transmitting in a storable data format" of the present invention.

[0025] <Message analysis unit 130 (message analysis means)> The message analysis unit 130 forms the core of the intelligence function in this embodiment, extracting multifaceted features from received voice messages to evaluate their value. This process is preferably initiated via a queuing system (e.g., RabbitMQ or AWS SQS) that is executed asynchronously when a voice message is received. This allows the system to maintain responsiveness even when a large number of messages are received simultaneously.

[0026] The processing of the message analysis unit 130 consists of the following three main steps.

[0027] (1) Text Conversion (Keyword Extraction): First, the audio data is converted into text data using a speech recognition engine. As the engine, a Transformer-based E2E (End-to-End) model (for example, OpenAI's Whisper) or a cloud service (for example, Google Cloud Speech-to-Text API, Amazon Transcribe) can be used. When performing speech recognition, the accuracy of recognition may be improved by providing attribute information such as the speaker's (child's) age as a hint. In addition, it is possible to retain not only the text with the highest confidence as a recognition result, but also multiple candidates (N-best list) and use them in subsequent processing.

[0028] The obtained text data is then segmented into words using a morphological analyzer (e.g., MeCab or Sudachi). These words are then compared against a pre-prepared keyword dictionary. Ideally, this dictionary should have a hierarchical structure, such as "emotion (broad category) - gratitude (medium category) - thank you (word)," rather than simply being a list of words. This allows for more flexible weighting during the scoring process.

[0029] (2) Speech Feature Extraction: Using open-source speech analysis libraries (e.g., Librosa, Praat, openSMILE), acoustic and prosodic features are extracted directly from the speech data. The features to be extracted include, but are not limited to, the following: • Fundamental frequency (pitch, F0): The pitch of the voice. Its average value, maximum value, minimum value, standard deviation, and rate of variation (jitter) are extracted. A significantly higher pitch than normal may suggest joy or excitement, while a lower pitch may suggest sadness or disappointment. • Power (energy, loudness): The volume of the voice. Extract the average, maximum, and magnitude of fluctuation (shimmer). A loud voice may suggest emphasis or anger, while a quiet voice may suggest whispering or lack of confidence. • Speech Rate: The number of morae or syllables per unit of time. A fast speech rate may suggest excitement or impatience, while a slow speech rate may suggest calmness or deep thought. • Timbre-related features: MFCC (Mel-Frequency Cepstrum Coefficient), spectral centroid, spectral spread, etc. These capture the characteristics of the voice quality itself. These features will be used for comparison with the standard model, which will be discussed later.

[0030] (3) Contextual information extraction: Information about the circumstances under which the message was sent is extracted. This includes the time the message was sent, the day of the week, and the interval between messages (sending frequency). For example, a message sent late at night when messages are not usually received, or a message sent on the recipient's birthday, may have special significance.

[0031] Furthermore, location information (latitude and longitude) transmitted from the GPS terminal 210 along with the voice message is also extracted as important contextual information. This location information is used in the scoring process described later to weight locations so that they receive a higher score if they are within a pre-registered specific area or if they are non-stationary locations that do not appear in past behavioral patterns. This specific area is set, for example, by the user using the map interface on the display terminal 220 application to draw circles or polygons and assign labels such as "home" or "school". Non-stationary locations that do not appear in past behavioral patterns are determined, for example, by clustering analysis of past GPS logs to determine that the location does not belong to any major activity cluster, or by determining that the location is more than a predetermined distance (e.g., 5 kilometers) away from a registered specific area.

[0032] As a variation, if a GPS terminal is equipped with an accelerometer, in addition to estimating the user's activity state (stationary, walking, running, etc.) from the sensor data, it is also extremely effective to detect sudden changes in the user's state and add them to the contextual information. For example, data detecting impacts exceeding a predetermined threshold or data showing a sudden transition from a running state to a stationary state can be extracted and used in the scoring described later. In this way, if a large change in acceleration and a cessation of movement are recorded along with the utterance "I fell," the reliability of the statement can be judged to be very high.

[0033] <Standard model generation unit 160 (standard model generation means)> The standard model generation unit 160 generates a personalized standard voice model for each user (speaker) and stores it in the user table 801 of the storage 105. This standard model represents the speaker's "normal" voice state and serves as an important baseline for value evaluation.

[0034] Speaker modeling techniques are used to generate the model. For example, the distribution of acoustic features (e.g., MFCCs) extracted from a set of past speech messages collected from a specific speaker (e.g., more than 100 messages from the past month) is modeled using a Gaussian Mixture Model (GMM).

[0035] As a more advanced variation, deep learning-based speaker embedding techniques (such as d-vectors or x-vectors) may be used. In this case, the speaker's voice features are represented as fixed-length, low-dimensional vectors (embedding vectors). The standard model is defined as the mean vector or distribution of these embedding vectors.

[0036] This standard model is updated through regular batch processing (e.g., late Sunday night) or whenever a certain number of new messages accumulate (e.g., 20). This allows it to adaptively track long-term voice changes, such as a child's voice changing.

[0037] Furthermore, the term "standard model" as used herein is not limited to specific advanced statistical models such as GMM or speaker embedding vectors. It can encompass any form of data structure or set of data that indicates the trends in the typical speech characteristics of a particular speaker. For example, even a simple set of statistical values ​​such as the average value and standard deviation of fundamental frequency (pitch) and power calculated from past message sets is considered a "standard model" in this invention, as long as it functions as a baseline for comparison with new speech messages in subsequent processing.

[0038] Next, the value evaluation and notification processing performed by the information processing device 100 will be described. Figure 5 is a flowchart of the value evaluation and notification processing performed by the information processing device according to this embodiment. This processing starts after the message analysis unit 130 has extracted various features from the voice message.

[0039] When the flowchart is started, the scoring unit 140 first calculates a score indicating the preservation value of the message using the features extracted by the message analysis unit 130 and the standard model generated by the standard model generation unit 160 (Step S501: Score calculation). Next, it is determined whether the calculated score exceeds a predetermined threshold (Step S502: Threshold determination).

[0040] If the result of the determination in step S502 determines that the score exceeds the threshold (step S502: YES), the process proceeds to step S503, and the notification unit 150 sends a notification to the display terminal 220 recommending that the message be saved (step S503: Notification execution). After that, the process ends. On the other hand, if the result of the determination in step S502 determines that the score does not exceed the threshold (step S502: NO), the process ends without sending a notification.

[0041] The scoring unit 140 and notification unit 150, which perform the main processes in this flowchart, will be described in more detail below.

[0042] <Scoring unit 140 (scoring means)> The scoring unit 140 uses multiple feature quantities extracted by the message analysis unit 130 and a standard model generated by the standard model generation unit 160 to comprehensively calculate a score indicating the preservation value of the message.

[0043] The scoring unit 140 compares the speech features extracted by the message analysis unit 130 with a standard model, for example, and adds a score based on the degree of deviation. For example, multiple rules may be combined, such as "add 15 points if the average pitch in the speech message deviates from the speaker's standard model by more than twice the standard deviation" or "add 30 points if a positive word (e.g., "thank you," "I love you") included in the keyword dictionary is detected," and the sum of these points may be used as the final score. In particular, the scoring means of this embodiment can perform advanced situational judgments that cannot be obtained from a single source of information by combining multiple information sources, and reflect these in the score. For example, an "incident detection logic" may be implemented. In this logic, if the keywords extracted by the message analysis means belong to a negative category such as "painful" or "fell down," and the location information obtained from the GPS terminal as contextual information changes rapidly within a predetermined time (e.g., a sudden deceleration of movement speed), the scoring means considers this not merely as a message but as a "significant event suggesting the possibility of an accident," and assigns an extremely high fixed value to the score (e.g., 100 out of 100), independently of other rules. This dramatically reduces the likelihood of parents missing important messages that warn of potential danger to their children, resulting in an unpredictable and remarkable effect that greatly contributes to improving the safety of this system.

[0044] The calculation method is not limited to a simple weighted linear sum model, Score = Σ(w_i * f_i). More advanced methods can be used, such as employing machine learning models trained in advance through supervised learning. For example, by training decision trees (e.g., XGBoost), support vector machines (SVMs), or neural networks using messages previously "saved" by users as positive examples and messages not "saved" as negative examples, it is possible to output a score indicating the probability that a new message is "valuable."

[0045] Furthermore, when using a machine learning model, in the initial stages of service use when sufficient user-specific response history has not yet been accumulated, a general-purpose scoring model trained using training data in which a large number of voice samples have been pre-labeled with value ratings by the developer may be applied. Then, after a predetermined number of user response histories, such as "save," have been accumulated, the system can be configured to dynamically transition from a general-purpose model to a personalized model that reflects the individual user's preferences by further training the system with the user's response history. This makes it possible to recommend high-value voice messages from the very beginning of service use.

[0046] A key feature of the scoring unit 140 is its self-optimizing function, which dynamically changes the weighting.

[0047] (a) Referencing event information: Event information such as "birthday" and "sports day" that users have registered in advance in the app is referenced from event table 803 in Figure 8. Then, around the date of the event (e.g., 3 days before and after), the feature weights are automatically increased if related keywords (e.g., "congratulations" and "you did your best") are detected.

[0048] (b) Learning from response history: The system learns from the user's response history to past save recommendation notifications (e.g., "saved," "shared," "ignored"). For example, if a user responds to a notification by "saving" (a positive response), the feature vector of that message is recorded. The system then calculates the similarity (e.g., cosine similarity) between that feature vector and a new message, and awards bonus points to the score for higher similarity. This allows the system to learn the user's implicit preferences and continuously improve the accuracy of personalization.

[0049] (Variation: Visualization of Dynamic Weighting) Furthermore, this system may have a function to provide feedback to the user that the scoring weighting has been dynamically changed. For example, if a user has registered "Sports Day" in event table 803 and a message containing a related keyword (e.g., "I did my best") is recommended for saving, the notification and in-app display may include a message suggesting that the event information has been referenced, such as "A special phrase for 'Sports Day'!" Also, if personalization progresses as the user frequently responds to (saves) the save recommendation notification, a message such as "We found these messages to suit your preferences" may be displayed. These UI displays can serve as indicators that allow external parties to objectively confirm that the scoring method has taken event information and response history into consideration.

[0050] <Notification unit 150 (notification means)> The notification unit 150 sends a notification to the display terminal 220 recommending saving the message when the score calculated by the scoring unit 140 exceeds a predetermined threshold. This threshold does not necessarily have to be a fixed value; it is preferable that it be dynamically adjusted for each user based on, for example, the user's response history to past save recommendation notifications (e.g., the percentage of saved messages). For example, if a user's save rate is low, the threshold can be lowered to increase the opportunities for notifications, and if the save rate is high, the threshold can be raised to select notifications more carefully, thereby maintaining an optimal notification frequency for each user.

[0051] Figure 10 shows an example of a message list screen 1000 according to this embodiment. The application on the display terminal 220 displays a list of voice messages transferred from the transfer unit 110 in chronological order. At this time, it is preferable to display messages 1010 whose scores calculated by the scoring unit 140 exceed a predetermined threshold separately from other normal messages 1020. In the example in Figure 10, a treasure chest-shaped save recommendation icon 1011 is assigned to messages 1010 that are recommended to be saved. This allows the user to intuitively find "treasure" candidates that the system has determined to be valuable from among a large number of messages.

[0052] Figure 7 shows an example screen of a notification with action buttons according to this embodiment. When the score exceeds a threshold, the notification unit 150 sends a rich push notification, which is different from a normal message reception notification. This notification includes a special message 702 (e.g., "You've found the perfect message!") and an icon to attract the user's attention. Furthermore, action buttons ("Save" 701, "Later" 703) that can be operated directly on the notification are set. Since the user can instruct actions such as saving or sharing the message directly from the notification without opening the app, the user experience is greatly improved.

[0053] (Modification: Display of recommendation reasons) As a further modification of this embodiment, the notification unit 150 may not only send a notification recommending saving, but also present the user with information that formed the basis of the recommendation. For example, the message 702 of the notification 700 with an action button shown in Figure 7 may include keyword information such as "You received the word 'I love you'!", or the message playback screen 600 within the application may have a UI that displays a summary of the analysis results by the message analysis unit 130 (e.g., results of vocal characteristics and sentiment analysis), such as "Your voice sounds more energetic than usual!". This recommendation reason display function helps the user understand why the message is special and deepen their attachment to it. At the same time, the very existence of such a display can serve as strong evidence that the message analysis means and scoring means in this invention are operating based on specific keywords or vocal characteristics.

[0054] <Video generation unit 170 (video generation means)> The video generation unit 170 generates a video file from audio and images in order to convert the audio message into a more emotionally expressive and shareable format. This process is often performed on demand in response to an explicit request from the user, but for messages with extremely high scores, a configuration in which the video is generated in advance and cached is also conceivable.

[0055] Furthermore, since video generation processing can place a heavy load on the server, it is desirable for the information processing device 100 to employ an asynchronous distributed processing architecture. Specifically, video generation requests from the display terminal 220 are first registered in a message queue, and a confirmation response is sent back to the requesting terminal. Then, one or more independent worker servers specializing in video generation sequentially retrieve requests from the message queue and execute the video generation process. Once processing is complete, the worker server saves the completed video file to shared storage and notifies the user of this fact via push notification or the like. With this configuration, even when a large number of requests occur simultaneously, it is possible to maintain stable system responsiveness and ensure high scalability.

[0056] Figure 9 shows the flow of the video generation process. First, when a video generation request is received from the display terminal 220 (step S301), the video generation unit 170 retrieves audio data and a profile image linked to the user table 801 from the storage 105 based on the message ID and user ID (step S302).

[0057] Next, a basic video is generated by combining audio and still images using a server-side multimedia conversion tool (e.g., ffmpeg) (step S303).

[0058] Figure 11 shows an example of the video generation settings screen 1100 according to this embodiment. Through this screen, the user can customize the generated video. For example, in the image source selection unit 1101, the user can select an image to be used as the background of the video from the user's profile picture, a pre-prepared library, or an AI-generated image. In the effect selection unit 1102, the user can check the video effects recommended by the system according to the emotion of the message (e.g., particle effects corresponding to the emotion of joy), or add or change any effect. When the preview button 1103 is pressed, a preview of the video reflecting the settings is played.

[0059] (Variation: Diversification of images used for video generation) The images used by the video generation means 170 are not limited to profile images associated with the user. The technical concept of the present invention encompasses the use of images obtained from various sources in order to generate more expressive videos that are in line with the content and context of the voice message. For example, the following image sources can be considered: (a) Keyword-linked images: Images selected from a pre-prepared library of illustrations and photographs based on keywords extracted by the message analysis unit 130. (b) Location information-linked images: Map and landscape images obtained from external map information services, etc., based on location information obtained from the GPS terminal 210. (c) AI-generated images: Images dynamically generated using an image generation AI based on text obtained by voice recognition and analyzed emotions. In a more advanced form, the video generation means can be configured to be closely linked with the evaluation logic of the scoring means and to generate context-appropriate videos fully automatically. For example, suppose the scoring means detects keywords expressing emotion such as "It's the sea" and "It's beautiful" from the voice message, (b) the GPS location information is a "non-stationary location", and (c) it falls within the event period of a "family trip" pre-registered by the user. If these three conditions are met, the scoring means evaluates the message to the highest rank as a "particularly valuable memory". Triggered by this evaluation result, the video generation means 170 automatically performs the following processes. First, it accesses external map information services and tourist information APIs using the GPS location information as a key, obtains landscape images and landmark images of the corresponding location, and sets them as the background of the video (location-linked images). Next, based on the emotion of "joy" analyzed by the message analysis means, it adds sparkling particle effects and upbeat background music. Finally, it synthesizes the voice recognition result text "Wow, it's the sea!" into the video as a caption with bouncy animation.In this way, by directly reflecting the multiple pieces of information that formed the basis of the scoring (keywords, location, events) into each element of video generation (background image, effects, text overlays), it becomes possible to generate highly personalized video content with extremely high added value, not just by combining audio and images, but by visually reproducing the emotions and context of that moment itself, without any user interaction. By using these diverse image sources individually or in combination, it becomes possible to provide video content that more richly expresses the situation and emotions of the message.

[0060] The most distinctive feature of the video generation unit 170 is its ability to add dynamic video effects that correspond to the emotions and content of the message. The video effects are changed according to the emotions (e.g., "joy," "sadness," "surprise") analyzed by the message analysis unit 130. This emotion analysis is performed using, for example, a rule-based method that combines keywords (e.g., "happy") and auditory features (e.g., rise in average pitch), or a machine learning model such as a convolutional neural network (CNN) that has been pre-trained to classify emotions using an audio spectrogram as input. This allows for highly accurate identification of emotion categories such as "joy" and "sadness."

[0061] • Emotion-based effects: For "joy," add sparkling particle effects and heart-shaped animations as overlays. For "sadness," process the entire video in sepia or monochrome, or add effects like falling rain. • Keyword-based effects: If the message contains the keyword "flower," a flower illustration stamp will be displayed; if it contains the keyword "train," a train illustration will be displayed. • Adding background music: Based on the emotion, the system automatically selects the most suitable song from a pre-prepared BGM library and adds it as background music. For example, a cheerful song in a major key will be selected for joyful emotions, and a quiet song in a minor key will be selected for sad emotions. • Adding text overlays: The text from the speech recognition results is embedded as subtitles (text overlays) in the video. At this time, the font type, color, and animation are also changed according to the emotion (e.g., a bouncy font for joy, a bold red font for anger).

[0062] These processes are achieved by dynamically combining various ffmpeg filters (overlay, drawtext, colorchannelmixer, etc.) to generate commands. Finally, the completed video file is sent back to the display terminal 220 via the data transmission unit 120.

[0063] In this specification, "analyzing emotions and...changing them according to the analyzed emotions" broadly refers to the process of adaptively determining the added visual effects based on information extracted from the audio. This includes not only cases where abstract emotion categories such as "joy" and "sadness" are identified using models such as CNNs, but also configurations in which the visual effects are switched using specific keywords extracted by the message analysis unit 130, or auditory features such as pitch and volume, as direct triggers. Such configurations also represent one aspect of the technical concept of the present invention, as they change the expression according to cues that suggest the speaker's emotions.

[0064] <External Linkage Unit 180 (External Linkage Means)> The external integration unit 180 acts as a hub for this system, providing functionality to aggregate important data to other cloud services that users normally use. Users configure account linking with external storage services (e.g., Google Photos, Dropbox, LINE Album, etc.) in advance on the app's settings screen. This linking should preferably be done in accordance with the OAuth 2.0 protocol. When the user transitions to the authentication screen of the external service from a button in the app and grants permission to access the data, an access token is issued to use the external service's API. This token is encrypted and securely stored in the storage 105 of the information processing device 100.

[0065] Users can flexibly set "storage policies" for saving data to linked services. Examples of policies include the following: • "Messages with a score of 90 or higher will generate a video and automatically upload it to Google Photos." • "Messages containing the keyword 'I love you' will be saved as audio files in a specific folder on Dropbox." • "Only notification messages to which you have responded with 'Save' will be posted to the LINE family album." Furthermore, the storage policy in this embodiment is not limited to a single static rule. Notably, a "hybrid policy" can be set that dynamically changes the level of automation in data storage according to a score indicating the value of the message. For example, the user can set a step-by-step rule such as "Automatic mode: If the score is above a first threshold (e.g., 95 points), send the data to external storage without requiring user confirmation" and "Confirmation mode: If the score is below the first threshold but above a second threshold (e.g., 80 points), send a notification with an action button and send the data only after the user explicitly acknowledges it." This hybrid policy enables the system to operate in a sophisticated manner, automatically persisting messages that it deems to be of extremely high value without missing any, while respecting the user's final decision for messages where the decision is uncertain. This simultaneously solves the trade-off between the risk of "unintended data storage" associated with full automation and the risk of "storage effort and opportunity loss" associated with full manual storage, resulting in a remarkable improvement in user convenience and peace of mind.

[0066] Figure 12 shows an example of a save policy settings screen 1200 that allows users to intuitively configure this save policy. On this screen, users can create and edit automatic data save rules in the format "IF 'Trigger' THEN 'Action'". In the trigger setting section 1201, the user selects the conditions for activating the policy (e.g., "Score is 90 or higher", "Keyword 'Thank you' is included") from a pull-down menu, etc. In the action setting section 1202, the user selects the action to be executed when the trigger is met (e.g., "Upload video to Google Photos", "Post to LINE album"). By pressing the add policy button 1203, the user can freely combine multiple policies. In addition to simply selecting an action such as "Upload to Google Photos" in the action setting section 1202 of this policy settings screen 1200, the UI may also include the ability to select the level of automation, such as "Execute automatically (automatic mode)" or "Confirm every time (confirmation mode)". This would allow users to intuitively build hybrid policies like the one described above.

[0067] Users can intuitively configure these storage policies on the application screen using templates in the format, for example, "IF 'Trigger' THEN 'Action'". Furthermore, if the external integration unit 180 fails to send data to the external storage service, it automatically performs a retry process according to a predetermined algorithm (e.g., exponential backoff). If multiple retries fail, or if the access token required for integration with the external service expires, the system notifies the user of the situation via push notifications or other means, prompting them to try again or re-authenticate. These functions ensure the reliability and robustness of automatic storage.

[0068] The external integration unit 180, triggered by the evaluation results of the scoring unit 140 or the response to the notification unit 150, calls the API of the corresponding external service and uploads data (voice, video, text, etc.) according to the configured policy. This allows users to centrally and permanently manage their precious memories across multiple services.

[0069] It should be noted that the present invention is not limited to the embodiments described above, and the components of each embodiment can be appropriately combined without departing from the purpose of the present invention. Furthermore, the present invention can be understood not only as an information processing device 100, but also as a voice message processing method executed by the information processing device 100, or as a program for causing a computer to execute that method.

[0070] This invention contributes to solving technical challenges such as improving computer functionality and user interfaces.

[0071] Computer function improvements: Optimization of processing load: Computationally demanding processes such as voice analysis, scoring, and video generation are centrally performed on the resource-rich information processing device (server) 100. Alternatively, as disclosed in this embodiment, it is possible to use a distributed processing configuration in which some or all of these processes are executed on the display terminal 220. In particular, the asynchronous distributed processing architecture disclosed in this embodiment has the technical effect of separating specific high-load processes (video generation) from the main response processing, thereby improving the scalability and stability of the entire system. This has the technical effect of minimizing the CPU load and battery consumption on the user terminals 200, such as the GPS terminal 210 and the display terminal 220.

[0072] Optimizing the data structure: By adopting a normalized database structure such as the user table 801, message table 802, and event table 803, as shown in Figure 8, data management is made more efficient, enabling faster searching and updating. In particular, managing standard models for each user, scoring weights, and dynamic thresholds improves the internal processing of the computer, allowing for more efficient personalization.

[0073] Reduced data usage: Instead of downloading all voice messages to the device in high quality at all times, only messages deemed to be of high value are notified, and data (voice or video) is sent only upon user request. Furthermore, features that dynamically convert data formats upon user request, and configurations that include the data itself in the push notification payload and send it in advance, have the technical effect of suppressing unnecessary data communication and reducing the load on network bandwidth.

[0074] User interface improvements: Improved Usability: Users are freed from the time and mental burden of listening to every single voice message they receive daily. The system proactively suggests potential "treasure" messages, reducing the risk of missing important messages. This is a concrete UI improvement that reduces the number of actions users need to take to discover information.

[0075] Improved Accessibility: Notifications 700 with action buttons 701 shorten the multi-step process of opening an app to a single tap from the notification. Furthermore, intuitive operation such as saving policy settings via an IFTTT-like UI and specifying areas on a map improves the computer's usability, allowing even users unfamiliar with computer operations to easily set up and execute advanced functions.

[0076] [General tasks] In the field of information processing, the aim is to facilitate the smooth exchange of voice message data between terminals with different characteristics, enabling users to easily save and utilize such data in their desired format, thereby improving user convenience.

[0077] [Note 1] [Note 1] The corresponding task is the same as the general task above. [Note 1] An information processing device that communicates with a user terminal including a display terminal having a display unit and a GPS terminal, and processes voice messages transmitted and received between the user terminals, comprising: a transfer means for transferring a voice message transmitted from the GPS terminal to the display terminal; and a data transmission means for transmitting the voice message in a data format that can be stored on the display terminal in response to a request from the display terminal. [Effects of Appendix 1] Users will be able to view voice messages transmitted from GPS terminals on display terminals and easily save them to their own terminals in their desired format, thereby improving convenience. More specifically, according to this disclosure, voice messages transmitted from GPS terminals can be provided in a data format that can be saved upon request from the user's display terminal. This allows users to view voice messages sent from GPS terminals on display terminals and permanently save them to their own terminals as needed. As a result, voice message data can be smoothly linked between terminals with different characteristics, and users can easily save and utilize the data in their desired format, thereby improving user convenience.

[0078] [Note 2] [Appendix 2] Issues related to this issue: Potentially valuable messages, contained within the large number of voice messages that users receive daily, may be buried and not recognized by the user, or may be deleted due to the expiration of the storage period. Additionally, carefully listening to every message and deciding whether to save it is a significant burden for the user. [Appendix 2] An information processing device according to claim 1, further comprising: a message analysis means for extracting contextual information from a voice message, including voice characteristics, predetermined keywords contained in the text obtained by transcribing the message, and the transmission time of the message; a scoring means for calculating a score indicating the storage value of the message based on the voice characteristics, keywords, and contextual information extracted by the message analysis means; and a notification means for sending a notification to the display terminal recommending the storage of the message when the score exceeds a predetermined threshold. [Effects of Appendix 2] While reducing the burden on users, the system can proactively discover valuable voice messages and notify users of their existence. This prevents important messages from being overlooked and encourages them to be saved.

[0079] [Note 3] [Appendix 3] Issues related to this issue: When the value of a message is evaluated using uniform criteria, it fails to reflect the individual user's values, changes over time such as a child's growth, or specific life events, leading to a decrease in evaluation accuracy. [Appendix 3] The information processing device according to claim 2, wherein the information processing device further comprises a standard model generation means for generating and storing a personalized standard model that shows the characteristics of the user's voice based on a group of past voice messages of the user, the message analysis means includes, as the voice characteristics, a degree of deviation calculated by comparing the voice characteristics of a newly received voice message with the standard model, and the scoring means dynamically changes the weights used to calculate the score based on event information obtained from an external source or the user's response history to past notifications. [Effects of Appendix 3] By comparing with personalized standard models and through self-learning based on user circumstances and feedback, it becomes possible to discover messages that are truly valuable to the user with greater accuracy and in a more adaptive manner.

[0080] [Note 4] [Appendix 4] Issues related to this issue: The reuse of voice messages tends to be limited to simply asking for repetition, and more value-added uses are not being utilized. [Appendix 4] An information processing device according to claim 2, further comprising a video generation means for generating a video file based on the voice message and the profile image associated with the user. [Effects of Appendix 4] Because voice messages can be easily converted into video content, the value and ways in which messages can be used and enjoyed expand, and sharing with family and friends becomes easier.

[0081] [Note 5] [Appendix 5] The issue is that the generated videos are uniform, consisting only of audio and still images, and do not adequately express the original emotions and nuances contained in the message. [Appendix 5] An information processing apparatus according to claim 4, wherein the message analysis means further analyzes the emotion of the voice message, and the video generation means changes the video effects to be added to the video file according to the analyzed emotion. [Effects of Appendix 5] By analyzing the emotions of a message and adding corresponding visual effects, it is possible to visually amplify the emotional impact and provide more expressive and memorable video content.

[0082] [Note 6] [Appendix 6] Issues related to this issue: The information sources for judging the value of a message are limited to audio and text, making it difficult to capture the specific circumstances in which the message was delivered. [Appendix 6] An information processing device according to claim 2, wherein the message analysis means further extracts location information acquired from the GPS terminal as contextual information, and the scoring means weights the score so that it is higher when the location information is within a specific area that has been registered in advance, or when it is an unstationary location that is not in past behavioral patterns. [Effects of Appendix 6] By adding location information unique to GPS devices to the basis of value judgments, it becomes possible to more effectively detect unusual events such as those at travel destinations, or important messages related to specific locations.

[0083] [Note 7] [Appendix 7] Issues related to this issue: Even if a message is judged to be of high value, there is a risk that the data may be lost as a result if the user forgets to save it or puts it off because it is troublesome. [Appendix 7] An information processing device according to claim 2, further comprising external linking means for automatically transmitting data generated from messages whose score exceeds a predetermined threshold, in accordance with a storage policy set in advance by the user, to a linked external storage service. [Effects of Appendix 7] Important data is automatically saved to secure external storage without any user intervention, ensuring data persistence and significantly improving user peace of mind and convenience.

[0084] [Note 8] [Appendix 8] Issues related to this issue: With the automatic data saving function, there was a dilemma: while complete automation raised concerns that data unintended by the user might be saved, requiring users to open the app and manually grant permission each time would diminish the convenience of automation. [Appendix 8] An information processing device according to claim 7, wherein the storage policy includes a confirmation mode in which the data is transmitted triggered by a positive response from a user to the notification, and the notification means transmits a notification to the display terminal which includes an action button that can be operated for the positive response. [Effects of Appendix 8] A sophisticated user interface can be provided, allowing users to complete final confirmation and save execution with a single tap on the notification. This allows for a high level of balance between the convenience of automatic saving and user control.

[0085] [Note 9] [Appendix 9] Issues corresponding to this issue. In the field of information processing, the aim is to facilitate the smooth exchange of voice message data between terminals with different characteristics, enabling users to easily save and utilize such data in their desired format, thereby improving user convenience. [Note 9] A voice message processing method for processing voice messages transmitted and received between a display terminal having a display unit and a user terminal including a GPS terminal, wherein a processor transfers a voice message transmitted from the GPS terminal to the display terminal, and the processor transmits the voice message to the display terminal in a data format that can be stored on the display terminal in response to a request from the display terminal. [Effects of Appendix 9] According to the voice message processing method of this embodiment, users can check voice messages sent from a GPS terminal on a display terminal and easily save them on their own terminal in a desired format, thereby improving convenience.

[0086] [Note 10] [Appendix 10] Issues corresponding to this issue. In the field of information processing, the aim is to facilitate the smooth exchange of voice message data between terminals with different characteristics, enabling users to easily save and utilize such data in their desired format, thereby improving user convenience. [Note 10] A program that causes a processor to perform the following actions: transfer a voice message sent from a GPS terminal to a display terminal, and, in response to a request from the display terminal, send the voice message to the display terminal in a data format that can be stored on the display terminal. [Effects of Appendix 10] According to the program of this embodiment, users can view voice messages transmitted from a GPS terminal on a display terminal and easily save them to their own terminal in a desired format, thereby improving convenience.

[0087] [Note 11] [Appendix 11] Issues corresponding to this] In the field of information processing, the aim is to facilitate the smooth exchange of voice message data between terminals with different characteristics, enabling users to easily save and utilize such data in their desired format, thereby improving user convenience. [Note 11] A voice message processing system comprising a user terminal including a display terminal having a display unit and a GPS terminal, and an information processing device for processing voice messages transmitted and received between the user terminals, wherein the information processing device comprises a transfer means for transferring a voice message transmitted from the GPS terminal to the display terminal, and a data transmission means for transmitting the voice message in a data format that can be stored on the display terminal in response to a request from the display terminal. [Effects of Appendix 11] According to the voice message processing system of this embodiment, users can view voice messages transmitted from a GPS terminal on a display terminal and easily save them on their own terminal in a desired format, thereby improving convenience. [Explanation of symbols]

[0088] 1…Voice message processing system 100... Information Processing Device 101…CPU 102...RAM 103...ROM 104...Communication Interface 105...Storage 110...Transfer unit (transfer means) 120...Data transmission unit (data transmission means) 130...Message analysis unit (message analysis means) 140... Scoring section (scoring method) 150...Notification section (notification means) 160...Standard model generation unit (standard model generation means) 170...Video generation unit (video generation means) 180...External Collaboration Department (External Collaboration Methods) 200... User terminals 210... GPS terminal 220… Display terminal 221...Display section 600... Audio playback screen 601... Download button 700…Notification screen 701... Action button (Save) 702… Notification message 703... Action button (later) 801...User Table 802...Message Table 803...Event Table 1000... Message list screen 1010... Recommended message to save 1011… Recommended icon to save 1020...Normal message 1100...Video generation settings screen 1101...Image source selection section 1102... Effect selection section 1103…Preview button 1200... Save policy settings screen 1201...Trigger setting section 1202... Action Setting Department 1203... Add Policy button NW...Network

Claims

1. An information processing device that communicates with a display terminal having a display unit and a user terminal including a GPS terminal, and processes voice messages transmitted and received between the user terminals, A transfer means for transferring voice messages transmitted from the GPS terminal to the display terminal, A data transmission means that, in response to a request from the display terminal, transmits the voice message in a data format that can be stored on the display terminal, An information processing device equipped with the following features.

2. An information processing apparatus according to claim 1, A message analysis means extracts contextual information from the aforementioned voice message, including voice characteristics, predetermined keywords contained in the text of the message, and the time the message was sent. A scoring means that calculates a score indicating the preservation value of the message based on the audio features, keywords, and contextual information extracted by the message analysis means, A notification means that, when the score exceeds a predetermined threshold, sends a notification to the display terminal recommending the saving of the message, The information processing apparatus according to claim 1, further comprising:

3. An information processing apparatus according to claim 2, The information processing device further comprises a standard model generation means for generating and storing a personalized standard model that represents the characteristics of the user's voice based on the user's past voice messages. The message analysis means includes, as the speech features, a degree of deviation calculated by comparing the speech features of a newly received speech message with the standard model. The information processing apparatus according to claim 2, wherein the scoring means dynamically changes the weighting used to calculate the score based on event information obtained from an external source or the user's response history to past notifications.

4. An information processing apparatus according to claim 2, An information processing device further comprising a video generation means for generating a video file based on the aforementioned voice message and the profile image associated with the user.

5. An information processing apparatus according to claim 4, The message analysis means further analyzes the emotion of the voice message, The video generation means is an information processing device that changes the video effects added to the video file according to the analyzed emotions.

6. An information processing apparatus according to claim 2, The message analysis means further extracts the location information obtained from the GPS terminal as contextual information, The scoring means is an information processing device that weights the score higher when the location information is within a specific area that has been registered in advance, or when it is a non-stationary location that is not part of past behavioral patterns.

7. An information processing apparatus according to claim 2, An information processing device further comprising an external linkage means for automatically sending data generated from messages whose score exceeds a predetermined threshold, in accordance with a storage policy set in advance by the user, to a linked external storage service.

8. An information processing apparatus according to claim 7, The storage policy includes a confirmation mode that sends the data triggered by a positive response from the user to the notification. The notification means is an information processing device that transmits a notification to the display terminal containing an action button that allows for operation for the positive response.

9. A voice message processing method for processing voice messages transmitted and received between a display terminal having a display unit and a user terminal including a GPS terminal, The processor forwards the voice message transmitted from the GPS terminal to the display terminal. The processor, in response to a request from the display terminal, transmits the voice message in a data format that can be stored on the display terminal. A method for processing voice messages.

10. In the processor, The voice message sent from the GPS terminal is forwarded to the display terminal. In response to a request from the display terminal, the voice message is transmitted in a data format that can be stored on the display terminal. A program that executes a process.

11. A user terminal including a display terminal having a display unit and a GPS terminal, A voice message processing system comprising an information processing device for processing voice messages sent and received between user terminals, The aforementioned information processing device is A transfer means for transferring voice messages transmitted from the GPS terminal to the display terminal, A data transmission means that, in response to a request from the display terminal, transmits the voice message in a data format that can be stored on the display terminal, A voice message processing system equipped with the following features.