system
A system that collects and converts user-selected information into personalized audio data, optimizing content based on feedback and preferences, addresses the challenge of obtaining relevant information during limited time, enhancing user experience.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- SOFTBANK GROUP CORP
- Filing Date
- 2024-12-09
- Publication Date
- 2026-06-19
AI Technical Summary
The challenge of efficiently obtaining personalized information during limited time periods, such as commuting, and the lack of effective means to utilize time due to insufficient personalized information delivery systems.
A system that allows users to input preferred information genres and desired playback times, collecting and summarizing relevant information in real-time from the internet and databases, converting it to audio data, and optimizing content based on user feedback and past preferences, with added background sounds for enhanced listening experience.
Enables efficient delivery of personalized audio information tailored to user interests and emotional states, enhancing the listening experience and improving time utilization.
Smart Images

Figure 2026100689000001_ABST
Abstract
Description
Technical Field
[0001] The technology of the present disclosure relates to a system.
Background Art
[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance as a response to the user utterance.
Prior Art Documents
Patent Documents
[0003]
Patent Document 1
Summary of the Invention
Problems to be Solved by the Invention
[0004] In modern times, there is a problem that it is difficult to efficiently obtain information relevant to oneself from the flood of information. Also, there is a problem that the effective utilization of time in daily life is hindered due to the lack of means to easily obtain personalized information during commuting time or limited free time.
Means for Solving the Problems
[0005] This invention proposes a system that, by allowing users to input their preferred information genres and desired playback time in advance, collects necessary information in real time from the internet and databases, analyzes and summarizes that information, and provides it as audio data. Furthermore, this system includes means for optimizing information provision based on user feedback and past preference history, thereby making the information provided more appropriate and personalized. Background sounds and jingles are added to the audio data to enhance the listening experience.
[0006] A "user" is an individual or organization that uses the system and receives information from it.
[0007] A "topic of interest" refers to a specific area of information that a user is interested in and wishes to receive information about.
[0008] "Desired playback time" refers to the time or period during which the user wishes to listen to the audio data.
[0009] A "database" is a collection of information used to store and manage information, and it is the source from which a system retrieves information.
[0010] "Internet resources" are information sources that exist on the web, and are online data sources that systems access to obtain information.
[0011] "Means of information gathering" refers to functions or processes for obtaining necessary information from databases and internet resources.
[0012] "Means of analyzing and summarizing information" refers to methods or techniques for scrutinizing collected information, extracting only the essential elements, and presenting them in a concise form.
[0013] "Means of converting to audio data" refers to speech synthesis technologies and processes for converting summarized information into audio format.
[0014] "Means for transmitting audio data" refers to a communication method or protocol for delivering generated audio data to a user's terminal.
[0015] "Means for playing audio data" refers to playback functions or devices that output audio data sent to the user's terminal in a format that can be listened to.
[0016] A "means of receiving feedback" refers to a process or interface for obtaining opinions and evaluations from users and using them to improve the system.
[0017] "Means of applying machine learning" refers to technologies that use algorithms based on user data and feedback to enhance personalization.
[0018] "Methods for adding background sounds or jingles" refer to methods or tools for incorporating additional sounds into audio data to improve the listening experience. [Brief explanation of the drawing]
[0019] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which a plurality of emotions are mapped. [Figure 10] It shows an emotion map to which a plurality of emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Example 2 when an emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when an emotion engine is combined.
Modes for Carrying Out the Invention
[0020] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.
[0021] First, the terms used in the following description will be explained.
[0022] In the following embodiments, a processor with a reference number (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of a plurality of arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of a plurality of types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.
[0023] In the following embodiments, signed RAM (Random Access Memory) is a memory that temporarily stores information and is used as work memory by the processor.
[0024] In the following embodiments, the signed storage is one or more non-volatile storage devices that store various programs and various parameters. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes.
[0025] In the following embodiments, the signed communication interface (I / F) is an interface that includes a communication processor and an antenna, etc. The communication interface manages communication between multiple computers. Examples of communication standards applicable to the communication interface include wireless communication standards such as 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).
[0026] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."
[0027] [First Embodiment]
[0028] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.
[0029] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.
[0030] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0031] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.
[0032] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.
[0033] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.
[0034] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.
[0035] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.
[0036] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.
[0037] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0038] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0039] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0040] The system of this invention provides personalized audio information by allowing the user to input a specific information genre and desired playback time. The system's operation is mainly performed by a server and a user terminal, which are described below.
[0041] When a user accesses the application, they select the information genres they are interested in and enter the time they want to listen to the radio each day. This confirms the requirements based on the user's specifications. Once the user enters and submits this information, the device sends the data to the server.
[0042] The server collects information in real time from the internet and existing databases based on specified information genres. By querying diverse information sources such as news sites and APIs, it gathers the latest and most relevant information. The server then summarizes the collected information using natural language processing techniques, extracting the essence of entertainment and news.
[0043] Based on this summarized information, the server generates audio data using speech synthesis technology. The audio data is customized to match the user's language and voice tone, and background sounds and jingles are added to further enhance the listening experience.
[0044] The generated audio data is sent from the server to the user's device. The user's device automatically plays this audio data at a set time, allowing the user to listen to the information hands-free.
[0045] As a concrete example, suppose a user is interested in international news and entertainment and sets their radio to tune in at 7 AM every morning. The server collects the latest information from international news sites and entertainment blogs, creates a summary, and then converts it into an audio radio program. The user's device plays the audio at 7 AM, allowing the user to listen to the latest news during their commute.
[0046] Furthermore, based on user feedback and usage history, the server optimizes the selection of information it provides using machine learning technology. This allows the system to provide information that better matches user preferences over time. In this way, information tailored to user needs can be efficiently delivered.
[0047] The following describes the processing flow.
[0048] Step 1:
[0049] The user launches the application and enters their preferred information genres and desired radio playback time. This information is then sent from the user's device to the server.
[0050] Step 2:
[0051] Based on the genre of information it receives, the server accesses relevant internet resources and databases to collect the latest information.
[0052] Step 3:
[0053] The server analyzes the collected information using natural language processing technology, extracts important information, and creates a summary. Here, it evaluates the relevance and news value of the information and tailors it to the user's interests.
[0054] Step 4:
[0055] The server inputs the generated summary into a speech synthesis system and converts it into audio data according to the specified language and voice tone. Furthermore, it enhances the listening experience by adding background sounds and jingles to the audio data.
[0056] Step 5:
[0057] The server sends the generated audio data to the user's device. The transmission is scheduled according to the playback time set by the user.
[0058] Step 6:
[0059] The device plays audio data at a set time. Users can listen to personalized information through the audio.
[0060] Step 7:
[0061] Users can provide feedback after listening. This feedback is sent from the device to the server and used to improve the information provided.
[0062] Step 8:
[0063] The server accumulates user feedback and past usage history, and applies machine learning to optimize the information collection and delivery process. This continuous learning process ensures that subsequent information is better aligned with the user's needs.
[0064] (Example 1)
[0065] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0066] In recent years, in our information-saturated society, users are required to efficiently and accurately acquire information that matches their interests and lifestyles. However, conventional information delivery systems have struggled to select and personalize information suitable for users from a vast amount of data. Furthermore, audio-based information lacks the flexibility to be customized to the user's preferences.
[0067] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0068] In this invention, the server includes means for receiving information areas of interest and desired playback time from the user; means for collecting information from a data store and a communication network based on the received information areas; means for using natural language processing technology to process and summarize the collected information; means for converting the summarized information into audio data using speech synthesis technology; and means for adding background sounds and sound effects to the audio data. This enables the user to effectively obtain personalized audio information tailored to their interests and lifestyle at a specified time.
[0069] A "user" is an entity that receives services from an information system that meet their own interests and needs.
[0070] "Areas of interest" refers to categories of specific topics or content that a user is interested in.
[0071] "Desired playback time" refers to the specific time period that the user specifies for listening to the audio information.
[0072] A "data store" refers to a storage device used to store and manage collected information and data.
[0073] A "communication network" refers to the infrastructure that enables the exchange of information and data, and includes the internet.
[0074] "Natural language processing technology" is a general term for technologies that enable computers to understand and process human language.
[0075] "Speech synthesis technology" refers to the technology that converts text information into speech.
[0076] "Background sounds and sound effects" refer to sounds used to complement the main audio and improve the listening experience.
[0077] An "automated learning algorithm" refers to a process that uses machine learning techniques to automatically learn patterns and rules from data and make decisions based on those patterns.
[0078] A "generative AI model" refers to an artificial intelligence model that is trained for a specific purpose and generates text, audio, images, and other data.
[0079] A "prompt statement" refers to an instruction given to a generative AI model to perform a specific task.
[0080] The system of the present invention provides customized audio information by allowing the user to input a specific information field and desired playback time. The system's implementation is primarily carried out through the operation of a server, a terminal, and the user.
[0081] Users use a device with a dedicated application installed to input their areas of interest (e.g., international news, entertainment) and the time they wish to receive the information (e.g., 7 AM every morning). This information is transmitted to the server via a communication network.
[0082] Based on the information received, the server collects information in real time via data stores and the internet. For example, it retrieves the latest information from news site APIs and RSS feeds. Query techniques are used here to efficiently collect data.
[0083] Next, the server processes the collected information using natural language processing techniques. This process uses a generative AI model to summarize the text. A concrete example of a prompt is, "Summarize the given news in three lines." This allows for the extraction of only the necessary information.
[0084] The summarized information is converted into audio format by the server using speech synthesis technology. This conversion utilizes a speech synthesis engine, and the voice tone is adjusted according to the user's language and preferences. Furthermore, background sounds and sound effects are added to enhance the listening experience.
[0085] The generated audio data is sent from the server to the user's device. The device plays the audio at the specified time, allowing the user to efficiently listen to the information while engaging in their daily activities. For example, it is possible to check the latest international news in audio format while commuting.
[0086] Furthermore, based on user feedback and past usage history, the server applies machine learning algorithms to optimize the selection of information it provides. This enhances its ability to deliver information that better matches the user's interests and preferences over time.
[0087] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0088] Step 1:
[0089] Users input their areas of interest and desired viewing time through a dedicated application. The user's device structures this input data in JSON format and sends it to the server. The input includes specific information such as "International News" and "Every morning at 7:00".
[0090] Step 2:
[0091] The server parses the received JSON data and collects information from relevant data stores and online sources (e.g., news APIs, RSS feeds) based on areas of interest. The server uses queries to search for information that matches the specified conditions and retrieves the latest data. This process outputs relevant information, such as specific news articles or entertainment news.
[0092] Step 3:
[0093] The server processes the collected information using natural language processing technology. Specifically, it summarizes long texts by prompting a generative AI model with the message, "Summarize the given news in three lines." The output is a summary that focuses on the points important to the user.
[0094] Step 4:
[0095] The server converts the summarized information into audio data using speech synthesis technology. The speech synthesis engine generates speech that matches the language and tone of voice specified by the user, and further enhances the listening experience by adding background music and sound effects. The generated audio file is then output.
[0096] Step 5:
[0097] The server sends audio data to the user's device. This transmission uses data transfer technology over the internet. The user's device prepares to start playing this audio data at a specified playback time (e.g., 7 AM every morning) using an alarm or notification.
[0098] Step 6:
[0099] The user's device plays audio data at the set time. Users can listen to it during their commute or while doing household chores without any special operation. This allows users to efficiently acquire personalized information.
[0100] Step 7:
[0101] The server collects user feedback and past usage history, and applies machine learning algorithms to improve the accuracy and relevance of the information it provides. This enables the provision of information tailored to the user's preferences over time.
[0102] (Application Example 1)
[0103] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0104] Due to information overload, users face the challenge of not being able to efficiently select the information they need and process it at their preferred time. Furthermore, the lack of sufficient personalized information tailored to users' interests and needs means they often receive a large amount of irrelevant information. Additionally, the lack of a centralized method for receiving information across multiple media formats increases the effort required for users.
[0105] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0106] In this invention, the server includes means for receiving the user's areas of interest and desired playback time, means for collecting information from information sources, and means for analyzing and summarizing the collected information. This allows the user to select information according to their interests and needs and listen to it efficiently at a specified time.
[0107] A "user" is someone who uses a device to receive information, specifying a particular information genre and desired playback time.
[0108] "Interested Information Genres" refer to categories of information that users are particularly interested in and wish to collect or share.
[0109] "Desired playback time" refers to the specific time when the user wishes to have the information played back.
[0110] "Information sources" refer to resources such as databases, websites, and APIs that provide information available on the internet.
[0111] "Means of information gathering" refer to the methods and techniques for extracting necessary information from information sources.
[0112] "Methods for analysis and summarization" refer to techniques for processing collected information, extracting its key points, and presenting them concisely.
[0113] "Audio data" refers to data that has been converted into audio format from summarized information intended to be provided to users.
[0114] "Background sound" refers to additional music or sound effects played alongside audio data, designed to enhance the listening experience.
[0115] A "user terminal" is an electronic device used to receive and play audio data.
[0116] "Means of playback" refers to technologies that enable users to listen to audio data on their devices.
[0117] "Feedback" refers to information and opinions provided by users, which are used to improve the services provided.
[0118] A "learning algorithm" is a computational method for optimizing information delivery based on feedback and historical data.
[0119] "Multiple media formats" refers to providing information in various formats, including not only audio, but also text, images, and more.
[0120] To implement this invention, a system is used in which a server and a user terminal cooperate. When a user uses a smartphone or other device to input and submit their preferred information genre and desired playback time, the server accepts this input and begins processing it.
[0121] The server collects necessary information from existing databases and various information sources on the internet. Specifically, it accesses news databases and websites such as blogs via APIs to retrieve content related to a specified genre. Next, it uses natural language processing software (e.g., NLTK or spaCy) running on the server side to analyze and summarize the collected information. This extracts the essence of the information and presents it in a format that is easy for the user to understand.
[0122] Next, the summarized information is converted into audio data using speech synthesis technology, such as Amazon's speech synthesis service. Background sounds may be added to this audio data to enhance listening comfort. The generated audio data is sent from the server to the user's device and automatically played at a set time. This allows users to obtain necessary information hands-free in their daily lives.
[0123] The system collects user feedback and past usage history, and uses a learning algorithm to optimize the information it provides. This allows for more personalized information delivery over time.
[0124] For example, if a user requests "technical news" and "8 AM every morning," a summary audio of the latest technology-related news will automatically play at that time every morning. An example of a prompt might be, "Generate a summary of technical news and output it in podcast format. The playback time should be 10 minutes." This system allows users to efficiently and continuously obtain the latest information tailored to their interests.
[0125] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0126] Step 1:
[0127] Users access the application using their smartphones or other devices and enter their preferred information genres and desired playback times. The application retrieves the entered data and sends it to the server. This input includes information such as "technical news" and "every morning at 8 AM," which serves as the basic data for subsequent processing.
[0128] Step 2:
[0129] The server collects appropriate information from its database of information sources and the internet based on the information genre received from the user. The server uses an API to retrieve information of the specified genre from news sites and specialized blogs. Here, the input is the information genre requested by the user, and the output is a raw data set related to that genre.
[0130] Step 3:
[0131] The server analyzes and summarizes the collected information using natural language processing software. Specifically, it uses natural language processing tools (e.g., spaCy) to analyze text and extract and compress important elements. The input is raw information data, and the output is summarized information provided to the user.
[0132] Step 4:
[0133] Based on the summarized information, the server generates audio data using speech synthesis technology. The summarized data is input into speech synthesis software (e.g., Amazon Polly) and converted into a speech format that closely resembles a human voice. Background sounds are added to this generated audio data as needed. The output is the final audio file.
[0134] Step 5:
[0135] The generated audio data is sent from the server to the user's terminal. The terminal is configured to automatically play the audio data at the specified playback time. The output here is the audio information that the user hears on the terminal.
[0136] Step 6:
[0137] Users provide feedback after listening. The server receives this feedback and applies machine learning algorithms to optimize future information delivery to better suit the user's preferences. The input is the user's feedback and behavioral history, and the output is a refined information delivery method.
[0138] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0139] This invention is a system for providing personalized voice information that takes user emotions into consideration. By incorporating an emotion engine, this system enables information provision based on the user's emotional state.
[0140] First, the user accesses the application and enters their preferred information genres and desired playback time. The user can also store their past preference and sentiment history in a database to further personalize the system. This allows for more refined information delivery.
[0141] The server receives information sent by the user and gathers information from internet resources and databases related to the specified information genre. The collected information is analyzed and summarized using natural language processing technology. Meanwhile, an emotion engine built into the terminal recognizes the user's emotions in real time through their voice and interactions and sends that data to the server.
[0142] The server uses emotion engine data to adjust the tone of the audio data to match the user's current emotions. For example, if the server determines that the user is stressed, it will adjust the audio to provide relaxing information and select a more calming tone. Background sounds and jingles are added to this audio data to further enhance the listening experience.
[0143] As a concrete example, suppose a user enjoys sports news and typically wants to refresh themselves during their morning commute. The server collects sports news, and an emotion engine analyzes the user's mood for the day, generating a relaxed-toned audio. The generated audio is sent to the user's device and automatically plays at a set time. The user can then listen to the radio while receiving information optimized for their mood that day.
[0144] In this way, the system can adapt to the user's emotions and preferences, providing more personalized voice information. This enables users to make better use of their time in daily life and improve the quality of information they receive.
[0145] The following describes the processing flow.
[0146] Step 1:
[0147] The user launches the application and enters their preferred information genres, desired playback time, and current emotional state. Past preference history is updated as needed.
[0148] Step 2:
[0149] The emotion engine built into the device recognizes the user's emotional state in real time through their voice and facial expressions, and sends that data to the server.
[0150] Step 3:
[0151] Based on the received information genre and sentiment data, the server queries relevant internet resources and databases to collect the latest information.
[0152] Step 4:
[0153] The server analyzes the collected information using natural language processing technology, extracting and summarizing important and relevant information. In doing so, it takes into account the user's emotional state.
[0154] Step 5:
[0155] The server adjusts the tone and content of the audio data to match the analyzed emotional data. For example, if the user wants to relax, it will choose a calm tone and an interesting topic.
[0156] Step 6:
[0157] The server uses speech synthesis technology to convert the summarized information into audio data, and adds background sounds and jingles to enhance the listening experience.
[0158] Step 7:
[0159] The server sends the generated audio data to the user's device. The audio data becomes available for playback according to the user's set playback time.
[0160] Step 8:
[0161] The device automatically plays audio data at a specified time. The user listens to information optimized for their emotional state and interests.
[0162] Step 9:
[0163] Users provide feedback via their device after listening. This feedback is sent to the server and used to optimize future information delivery.
[0164] Step 10:
[0165] The server accumulates user feedback and past usage history, and applies machine learning to further optimize its information delivery strategy. This update is continuous and becomes increasingly personalized.
[0166] (Example 2)
[0167] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0168] When providing information to users, simply broadcasting standardized information via audio presents a challenge in adequately satisfying the emotional needs and preferences of individual users. Furthermore, improving the quality of information provided and the user experience requires considering the user's individual emotional state and past preference history, but there is a lack of systems that can appropriately reflect this.
[0169] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0170] In this invention, the server includes means for receiving information genres and desired playback times from the user, means for collecting information from information resources, and means for analyzing and summarizing the collected information using natural language processing technology. This makes it possible to provide personalized audio information tailored to the user's emotional state.
[0171] "Information genre" refers to the specific types or categories of information that users are interested in.
[0172] "Desired playback time" refers to the time period or duration during which the user wishes to receive the information.
[0173] "Information resources" refers to a collection of internet resources and databases that are accessed to obtain information.
[0174] "Natural language processing technology" refers to methods and techniques that enable computers to understand and process the language that humans use on a daily basis.
[0175] A "generative AI model" refers to algorithms and technologies that use artificial intelligence to generate text, speech, and other data.
[0176] An "emotion engine" refers to a technology or system that recognizes a user's emotional state and performs processing according to that state.
[0177] "Machine learning" refers to the technology of learning patterns from data to automate or improve specific tasks.
[0178] "Tone adjustment" refers to modifying the sound quality and expression of audio data to match the user's emotions and preferences.
[0179] This system aims to provide personalized audio information based on the user's emotional state and preferences. First, the user enters their preferred information genres and desired playback time through the application. This information is received by the server. The server utilizes databases and the internet to collect content from information resources related to the specified genres. The received information is analyzed and summarized using natural language processing techniques. Specifically, text summarization models and topic modeling are used to concisely summarize vast amounts of information.
[0180] The device incorporates an emotion engine that recognizes emotions in real time through the user's voice and interactions. The recognized emotion data is sent to a server and used to generate audio data. The server uses this emotion data and collected information to create audio data using a generative AI model. The tone and pace of the voice are adjusted to suit the user's emotional state. Furthermore, background sounds and jingles are added to the audio data to enhance the listening experience.
[0181] The final audio data is sent to the user's device and automatically played at the specified time. For example, if a user requests to "listen to the morning news for 10 minutes," the server will collect the latest news, and if the emotion engine determines that the user is in relaxation mode, it will generate audio in a calm tone. Prompts may include phrases such as, "Generate an appropriate audio tone when the user is seeking relaxation," or "Create optimal audio information based on the user's past preference history and current mood."
[0182] This system utilizes conventional computers and smart devices as hardware, and leverages speech recognition software and generative AI models as software. As a result, users can acquire information on a daily basis that is more emotionally rich and tailored to their needs, thereby improving their quality of life.
[0183] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0184] Step 1:
[0185] The user accesses the application and enters their preferred information genre and desired playback time. This information is received by the server. The input here consists of "information genre" and "desired playback time," and the server prepares to collect information based on this. Specifically, by entering "economic news" as the information genre and "7 AM" as the desired playback time, information collection for that time period will begin.
[0186] Step 2:
[0187] The server collects relevant information from internet resources and databases based on the received information genre. It generates queries for information collection and performs database access and web crawling. The input is the "information genre," and the output is a "set of related information." For example, it might collect the latest articles related to "economic news."
[0188] Step 3:
[0189] The server analyzes and summarizes the collected relevant information using natural language processing techniques. It uses an information processing engine to summarize the input "set of relevant information" and output simplified "summary information." Specifically, it uses a text summarization algorithm to extract only the most important news points.
[0190] Step 4:
[0191] The device uses an emotion engine to recognize the user's real-time emotional state. It analyzes voice input and interactions to determine the emotional state. The input is "user voice data" and "interaction data," and the output is "the user's emotional state." Specifically, it uses speech recognition and machine learning models to determine the user's emotions from the tone of their voice and interactions.
[0192] Step 5:
[0193] The server generates audio data using a generative AI model based on the analyzed user's emotional state and summary information. The input is "summary information" and "user's emotional state," and the output is "audio data." Specifically, it generates audio using speech synthesis technology with a tone and pace that corresponds to the emotion.
[0194] Step 6:
[0195] The server performs final adjustments to the generated audio data by adding background sounds and jingles. The input here is the "audio data," and the output is the "adjusted audio data." Specifically, it selects background sounds from a pre-prepared audio library and overlays them onto the audio data.
[0196] Step 7:
[0197] The server sends the adjusted audio data to the user's terminal. The input is the "adjusted audio data," and the output is the "transfer of audio data to the user's terminal." The specific operation is the procedure of sending data over a network.
[0198] Step 8:
[0199] The user receives information by playing audio data at a time set on their device. The input is "audio data to the user's device," and the output is "audio playback." In terms of operation, the audio will play automatically at the specified time.
[0200] Through this series of steps, the system can provide personalized information tailored to the user's needs and emotions.
[0201] (Application Example 2)
[0202] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".
[0203] Conventional voice information systems have struggled to provide information tailored to the user's emotional state, often resulting in a lack of self-optimized information. As a result, users sometimes feel that the information they receive is not sufficiently personalized. In particular, providing voice information in an appropriate tone according to emotions is currently lacking in current technology.
[0204] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0205] In this invention, the server includes means for receiving the type of information of interest and the desired playback time from the user, means for collecting information from data storage devices and network resources based on the received information type, and means for analyzing and summarizing the collected information. This makes it possible to provide personalized audio information that corresponds to the user's emotional state.
[0206] A "user" refers to an individual or organization that uses an information system or device to receive services or data.
[0207] "Types of Interested Information" refers to a classification of the categories and fields of information that a user is interested in.
[0208] "Desired playback time" refers to a specific time slot or date when a user wishes to play information or content.
[0209] A "data storage device" refers to a device or medium that stores large amounts of data and allows access and retrieval of it as needed.
[0210] "Network resources" refer to data, information, and services available on the internet and other networks.
[0211] "Methods for analyzing and summarizing information" refers to techniques for analyzing large amounts of information data, extracting key points and important elements, and making them concise.
[0212] "Audio format" refers to a format in which digital or analog data is converted into an audio signal and made playable.
[0213] "User equipment" refers to electronic devices such as computers, smartphones, and tablets that are used directly by the user.
[0214] An "emotion analysis device" refers to a device or software that analyzes a user's emotions and psychological state and makes that data available for use.
[0215] "Adjusting the tone of an audio format" refers to changing the pitch, speed, intonation, etc., of audio data to express specific emotions or moods.
[0216] The system for implementing this invention consists of a server and a user terminal working together. The server receives the type of information of interest and the desired playback time from the user. Based on this, it collects relevant information from data storage devices and network resources. Specifically, it analyzes and summarizes the information using natural language processing technologies such as Google Cloud and IBM Watson. When this summarized information is converted into audio format and sent to the user terminal, the system analyzes the user's emotions using an emotion analysis device such as Microsoft Azure's Emotion API. The tone of the audio format is then adjusted to enable the provision of appropriate information according to the emotion.
[0217] Users receive audio information via their smartphones or tablets, timed to their preferred schedule. This information is personalized according to the user's emotional state, providing a more intuitive and satisfying listening experience. For example, a feature is implemented that plays uplifting audio during the morning commute. The system applies automated learning technology based on user feedback to optimize information delivery. In the future, further personalization using more sophisticated generative AI models is being considered.
[0218] For example, if a user is determined to be in a relaxed mood on a Monday morning, they can receive audio content that includes calming background music along with the day's news and weather information. This allows users to access information optimized for their emotions, making daily information acquisition more comfortable. By applying the prompt, "Use a generative AI model to suggest audio content best suited to the user's emotions," AI-driven data processing becomes possible.
[0219] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0220] Step 1:
[0221] The user enters their type of interest and desired playback time via a terminal. The terminal sends this input to the server. The server uses the type of interest and desired playback time received from the user to prepare to access data storage devices and network resources to retrieve the necessary information.
[0222] Step 2:
[0223] The server collects data related to the user's specified type of information interest from data storage devices and network resources. The collected data is analyzed using natural language processing techniques to generate a summary. The input is the user-specified type of information, and the output is the summarized information data.
[0224] Step 3:
[0225] The server converts summarized information into audio format. The input is summarized text information, and the output is audio data. At this time, an emotion analysis device is used to analyze the user's emotional state, and the tone of the audio data is adjusted based on the results. The results of the emotion analysis are used as parameters for adjusting the audio tone.
[0226] Step 4:
[0227] The server sends the adjusted audio data to the user terminal. The input is tone-adjusted audio data, and the output is the transmission of data to the user terminal.
[0228] Step 5:
[0229] The user's device plays the received audio data at the specified playback time. The audio data being played may include background sounds and jingles to enhance the user's listening experience.
[0230] Step 6:
[0231] After listening, users send feedback to the server regarding their emotional state and satisfaction with the information. The server receives this feedback and uses machine learning algorithms to optimize information delivery. The feedback is used as data to improve future information delivery.
[0232] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.
[0233] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0234] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.
[0235] [Second Embodiment]
[0236] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.
[0237] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.
[0238] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0239] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.
[0240] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0241] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0242] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0243] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0244] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0245] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0246] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0247] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0248] The system of this invention provides personalized audio information by allowing the user to input a specific information genre and desired playback time. The system's operation is mainly performed by a server and a user terminal, which are described below.
[0249] When a user accesses the application, they select the information genres they are interested in and enter the time they want to listen to the radio each day. This confirms the requirements based on the user's specifications. Once the user enters and submits this information, the device sends the data to the server.
[0250] The server collects information in real time from the internet and existing databases based on specified information genres. By querying diverse information sources such as news sites and APIs, it gathers the latest and most relevant information. The server then summarizes the collected information using natural language processing techniques, extracting the essence of entertainment and news.
[0251] Based on this summarized information, the server generates audio data using speech synthesis technology. The audio data is customized to match the user's language and voice tone, and background sounds and jingles are added to further enhance the listening experience.
[0252] The generated audio data is sent from the server to the user's device. The user's device automatically plays this audio data at a set time, allowing the user to listen to the information hands-free.
[0253] As a concrete example, suppose a user is interested in international news and entertainment and sets their radio to tune in at 7 AM every morning. The server collects the latest information from international news sites and entertainment blogs, creates a summary, and then converts it into an audio radio program. The user's device plays the audio at 7 AM, allowing the user to listen to the latest news during their commute.
[0254] Furthermore, based on user feedback and usage history, the server optimizes the selection of information it provides using machine learning technology. This allows the system to provide information that better matches user preferences over time. In this way, information tailored to user needs can be efficiently delivered.
[0255] The following describes the processing flow.
[0256] Step 1:
[0257] The user launches the application and enters their preferred information genres and desired radio playback time. This information is then sent from the user's device to the server.
[0258] Step 2:
[0259] Based on the genre of information it receives, the server accesses relevant internet resources and databases to collect the latest information.
[0260] Step 3:
[0261] The server analyzes the collected information using natural language processing technology, extracts important information, and creates a summary. Here, it evaluates the relevance and news value of the information and tailors it to the user's interests.
[0262] Step 4:
[0263] The server inputs the generated summary into a speech synthesis system and converts it into audio data according to the specified language and voice tone. Furthermore, it enhances the listening experience by adding background sounds and jingles to the audio data.
[0264] Step 5:
[0265] The server sends the generated audio data to the user's device. The transmission is scheduled according to the playback time set by the user.
[0266] Step 6:
[0267] The device plays audio data at a set time. Users can listen to personalized information through the audio.
[0268] Step 7:
[0269] Users can provide feedback after listening. This feedback is sent from the device to the server and used to improve the information provided.
[0270] Step 8:
[0271] The server accumulates user feedback and past usage history, and applies machine learning to optimize the information collection and delivery process. This continuous learning process ensures that subsequent information is better aligned with the user's needs.
[0272] (Example 1)
[0273] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0274] In recent years, in our information-saturated society, users are required to efficiently and accurately acquire information that matches their interests and lifestyles. However, conventional information delivery systems have struggled to select and personalize information suitable for users from a vast amount of data. Furthermore, audio-based information lacks the flexibility to be customized to the user's preferences.
[0275] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0276] In this invention, the server includes means for receiving information areas of interest and desired playback time from the user; means for collecting information from a data store and a communication network based on the received information areas; means for using natural language processing technology to process and summarize the collected information; means for converting the summarized information into audio data using speech synthesis technology; and means for adding background sounds and sound effects to the audio data. This enables the user to effectively obtain personalized audio information tailored to their interests and lifestyle at a specified time.
[0277] A "user" is an entity that receives services from an information system that meet their own interests and needs.
[0278] The "area of interest information" refers to categories of specific topics or content that a user is interested in or concerned about.
[0279] The "desired playback time" indicates the specific time when the user designates to listen to the voice information.
[0280] The "data store" refers to a storage device for storing and managing the collected information and data.
[0281] The "communication network" refers to the infrastructure that enables the exchange of information and data, including the Internet and the like.
[0282] The "natural language processing technology" is a general term for technologies that enable a computer to understand and process human language.
[0283] The "text-to-speech technology" refers to the technology that converts text information into voice.
[0284] The "background sound and sound effects" refer to the sounds used to complement the main voice and improve the listening experience.
[0285] The "automatic learning algorithm" refers to the process of automatically learning patterns and rules from data using machine learning technology and making decisions.
[0286] The "generative AI model" refers to an artificial intelligence model trained for a specific purpose and capable of generating text, voice, images, and the like.
[0287] The "prompt text" refers to the instruction text input to the generative AI model to execute a specific task.
[0288] The system of the present invention provides customized voice information by the user inputting a specific information field and a desired playback time. Embodiments of the system are mainly accomplished by a server, a terminal, and user operations.
[0289] Users use a device with a dedicated application installed to input their areas of interest (e.g., international news, entertainment) and the time they wish to receive the information (e.g., 7 AM every morning). This information is transmitted to the server via a communication network.
[0290] Based on the information received, the server collects information in real time via data stores and the internet. For example, it retrieves the latest information from news site APIs and RSS feeds. Query techniques are used here to efficiently collect data.
[0291] Next, the server processes the collected information using natural language processing techniques. This process uses a generative AI model to summarize the text. A concrete example of a prompt is, "Summarize the given news in three lines." This allows for the extraction of only the necessary information.
[0292] The summarized information is converted into audio format by the server using speech synthesis technology. This conversion utilizes a speech synthesis engine, and the voice tone is adjusted according to the user's language and preferences. Furthermore, background sounds and sound effects are added to enhance the listening experience.
[0293] The generated audio data is sent from the server to the user's device. The device plays the audio at the specified time, allowing the user to efficiently listen to the information while engaging in their daily activities. For example, it is possible to check the latest international news in audio format while commuting.
[0294] Furthermore, based on user feedback and past usage history, the server applies machine learning algorithms to optimize the selection of information it provides. This enhances its ability to deliver information that better matches the user's interests and preferences over time.
[0295] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0296] Step 1:
[0297] Users input their areas of interest and desired viewing time through a dedicated application. The user's device structures this input data in JSON format and sends it to the server. The input includes specific information such as "International News" and "Every morning at 7:00".
[0298] Step 2:
[0299] The server parses the received JSON data and collects information from relevant data stores and online sources (e.g., news APIs, RSS feeds) based on areas of interest. The server uses queries to search for information that matches the specified conditions and retrieves the latest data. This process outputs relevant information, such as specific news articles or entertainment news.
[0300] Step 3:
[0301] The server processes the collected information using natural language processing technology. Specifically, it summarizes long texts by prompting a generative AI model with the message, "Summarize the given news in three lines." The output is a summary that focuses on the points important to the user.
[0302] Step 4:
[0303] The server converts the summarized information into audio data using speech synthesis technology. The speech synthesis engine generates speech that matches the language and tone of voice specified by the user, and further enhances the listening experience by adding background music and sound effects. The generated audio file is then output.
[0304] Step 5:
[0305] The server transmits voice data to the user's terminal. For this transmission, data transfer technology via the Internet is used. The user's terminal prepares to start playing this voice data at a specified playback time (e.g., 7:00 every morning) using an alarm or notification.
[0306] Step 6:
[0307] The user's terminal plays the voice data at the set time. The user can listen during commuting or in between housework without performing a special operation. This enables the user to efficiently obtain personalized information.
[0308] Step 7:
[0309] The server collects the user's feedback and past usage history, and applies machine learning algorithms to improve the accuracy and relevance of the information provided. This realizes the provision of information that matches the user's preferences over time.
[0310] (Application Example 1)
[0311] Next, Application Example 1 will be described. In the following description, the data processing device 12 is referred to as the "server", and the smart glasses 214 are referred to as the "terminal".
[0312] Due to an overabundance of information, there are problems such as it being difficult for users to efficiently select the information they need and digest it according to the desired time. Also, since information provision personalized to the user's interests and needs is not sufficiently carried out, there is an issue of receiving a lot of information that the user is not interested in. Furthermore, since there is a lack of a method for receiving information in a unified manner in multiple media formats, there is a problem of increased operation effort.
[0313] The specific processing by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0314] In this invention, the server includes means for receiving the user's areas of interest and desired playback time, means for collecting information from information sources, and means for analyzing and summarizing the collected information. This allows the user to select information according to their interests and needs and listen to it efficiently at a specified time.
[0315] A "user" is someone who uses a device to receive information, specifying a particular information genre and desired playback time.
[0316] "Interested Information Genres" refer to categories of information that users are particularly interested in and wish to collect or share.
[0317] "Desired playback time" refers to the specific time when the user wishes to have the information played back.
[0318] "Information sources" refer to resources such as databases, websites, and APIs that provide information available on the internet.
[0319] "Means of information gathering" refer to the methods and techniques for extracting necessary information from information sources.
[0320] "Methods for analysis and summarization" refer to techniques for processing collected information, extracting its key points, and presenting them concisely.
[0321] "Audio data" refers to data that has been converted into audio format from summarized information intended to be provided to users.
[0322] "Background sound" refers to additional music or sound effects played alongside audio data, designed to enhance the listening experience.
[0323] A "user terminal" is an electronic device used to receive and play audio data.
[0324] "Means of playback" refers to technologies that enable users to listen to audio data on their devices.
[0325] "Feedback" refers to information and opinions provided by users, which are used to improve the services provided.
[0326] A "learning algorithm" is a computational method for optimizing information delivery based on feedback and historical data.
[0327] "Multiple media formats" refers to providing information in various formats, including not only audio, but also text, images, and more.
[0328] To implement this invention, a system is used in which a server and a user terminal cooperate. When a user uses a smartphone or other device to input and submit their preferred information genre and desired playback time, the server accepts this input and begins processing it.
[0329] The server collects necessary information from existing databases and various information sources on the internet. Specifically, it accesses news databases and websites such as blogs via APIs to retrieve content related to a specified genre. Next, it uses natural language processing software (e.g., NLTK or spaCy) running on the server side to analyze and summarize the collected information. This extracts the essence of the information and presents it in a format that is easy for the user to understand.
[0330] Next, the summarized information is converted into audio data using speech synthesis technology, such as Amazon's speech synthesis service. Background sounds may be added to this audio data to enhance listening comfort. The generated audio data is sent from the server to the user's device and automatically played at a set time. This allows users to obtain necessary information hands-free in their daily lives.
[0331] The system collects user feedback and past usage history, and uses a learning algorithm to optimize the information it provides. This allows for more personalized information delivery over time.
[0332] For example, if a user requests "technical news" and "8 AM every morning," a summary audio of the latest technology-related news will automatically play at that time every morning. An example of a prompt might be, "Generate a summary of technical news and output it in podcast format. The playback time should be 10 minutes." This system allows users to efficiently and continuously obtain the latest information tailored to their interests.
[0333] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0334] Step 1:
[0335] Users access the application using their smartphones or other devices and enter their preferred information genres and desired playback times. The application retrieves the entered data and sends it to the server. This input includes information such as "technical news" and "every morning at 8 AM," which serves as the basic data for subsequent processing.
[0336] Step 2:
[0337] The server collects appropriate information from its database of information sources and the internet based on the information genre received from the user. The server uses an API to retrieve information of the specified genre from news sites and specialized blogs. Here, the input is the information genre requested by the user, and the output is a raw data set related to that genre.
[0338] Step 3:
[0339] The server analyzes and summarizes the collected information using natural language processing software. Specifically, it uses natural language processing tools (e.g., spaCy) to analyze text and extract and compress important elements. The input is raw information data, and the output is summarized information provided to the user.
[0340] Step 4:
[0341] Based on the summarized information, the server generates audio data using speech synthesis technology. The summarized data is input into speech synthesis software (e.g., Amazon Polly) and converted into a speech format that closely resembles a human voice. Background sounds are added to this generated audio data as needed. The output is the final audio file.
[0342] Step 5:
[0343] The generated audio data is sent from the server to the user's terminal. The terminal is configured to automatically play the audio data at the specified playback time. The output here is the audio information that the user hears on the terminal.
[0344] Step 6:
[0345] Users provide feedback after listening. The server receives this feedback and applies machine learning algorithms to optimize future information delivery to better suit the user's preferences. The input is the user's feedback and behavioral history, and the output is a refined information delivery method.
[0346] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0347] This invention is a system for providing personalized voice information that takes user emotions into consideration. By incorporating an emotion engine, this system enables information provision based on the user's emotional state.
[0348] First, the user accesses the application and enters their preferred information genres and desired playback time. The user can also store their past preference and sentiment history in a database to further personalize the system. This allows for more refined information delivery.
[0349] The server receives information sent by the user and gathers information from internet resources and databases related to the specified information genre. The collected information is analyzed and summarized using natural language processing technology. Meanwhile, an emotion engine built into the terminal recognizes the user's emotions in real time through their voice and interactions and sends that data to the server.
[0350] The server uses emotion engine data to adjust the tone of the audio data to match the user's current emotions. For example, if the server determines that the user is stressed, it will adjust the audio to provide relaxing information and select a more calming tone. Background sounds and jingles are added to this audio data to further enhance the listening experience.
[0351] As a concrete example, suppose a user enjoys sports news and typically wants to refresh themselves during their morning commute. The server collects sports news, and an emotion engine analyzes the user's mood for the day, generating a relaxed-toned audio. The generated audio is sent to the user's device and automatically plays at a set time. The user can then listen to the radio while receiving information optimized for their mood that day.
[0352] In this way, the system can adapt to the user's emotions and preferences, providing more personalized voice information. This enables users to make better use of their time in daily life and improve the quality of information they receive.
[0353] The following describes the processing flow.
[0354] Step 1:
[0355] The user launches the application and enters their preferred information genres, desired playback time, and current emotional state. Past preference history is updated as needed.
[0356] Step 2:
[0357] The emotion engine built into the device recognizes the user's emotional state in real time through their voice and facial expressions, and sends that data to the server.
[0358] Step 3:
[0359] Based on the received information genre and sentiment data, the server queries relevant internet resources and databases to collect the latest information.
[0360] Step 4:
[0361] The server analyzes the collected information using natural language processing technology, extracting and summarizing important and relevant information. In doing so, it takes into account the user's emotional state.
[0362] Step 5:
[0363] The server adjusts the tone and content of the audio data to match the analyzed emotional data. For example, if the user wants to relax, it will choose a calm tone and an interesting topic.
[0364] Step 6:
[0365] The server uses speech synthesis technology to convert the summarized information into audio data, and adds background sounds and jingles to enhance the listening experience.
[0366] Step 7:
[0367] The server sends the generated audio data to the user's device. The audio data becomes available for playback according to the user's set playback time.
[0368] Step 8:
[0369] The device automatically plays audio data at a specified time. The user listens to information optimized for their emotional state and interests.
[0370] Step 9:
[0371] Users provide feedback via their device after listening. This feedback is sent to the server and used to optimize future information delivery.
[0372] Step 10:
[0373] The server accumulates user feedback and past usage history, and applies machine learning to further optimize its information delivery strategy. This update is continuous and becomes increasingly personalized.
[0374] (Example 2)
[0375] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0376] When providing information to users, simply broadcasting standardized information via audio presents a challenge in adequately satisfying the emotional needs and preferences of individual users. Furthermore, improving the quality of information provided and the user experience requires considering the user's individual emotional state and past preference history, but there is a lack of systems that can appropriately reflect this.
[0377] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0378] In this invention, the server includes means for receiving information genres and desired playback times from the user, means for collecting information from information resources, and means for analyzing and summarizing the collected information using natural language processing technology. This makes it possible to provide personalized audio information tailored to the user's emotional state.
[0379] "Information genre" refers to the specific types or categories of information that users are interested in.
[0380] "Desired playback time" refers to the time period or duration during which the user wishes to receive the information.
[0381] "Information resources" refers to a collection of internet resources and databases that are accessed to obtain information.
[0382] "Natural language processing technology" refers to methods and techniques that enable computers to understand and process the language that humans use on a daily basis.
[0383] A "generative AI model" refers to algorithms and technologies that use artificial intelligence to generate text, speech, and other data.
[0384] An "emotion engine" refers to a technology or system that recognizes a user's emotional state and performs processing according to that state.
[0385] "Machine learning" refers to the technology of learning patterns from data to automate or improve specific tasks.
[0386] "Tone adjustment" refers to modifying the sound quality and expression of audio data to match the user's emotions and preferences.
[0387] This system aims to provide personalized audio information based on the user's emotional state and preferences. First, the user enters their preferred information genres and desired playback time through the application. This information is received by the server. The server utilizes databases and the internet to collect content from information resources related to the specified genres. The received information is analyzed and summarized using natural language processing techniques. Specifically, text summarization models and topic modeling are used to concisely summarize vast amounts of information.
[0388] The device incorporates an emotion engine that recognizes emotions in real time through the user's voice and interactions. The recognized emotion data is sent to a server and used to generate audio data. The server uses this emotion data and collected information to create audio data using a generative AI model. The tone and pace of the voice are adjusted to suit the user's emotional state. Furthermore, background sounds and jingles are added to the audio data to enhance the listening experience.
[0389] The final audio data is sent to the user's device and automatically played at the specified time. For example, if a user requests to "listen to the morning news for 10 minutes," the server will collect the latest news, and if the emotion engine determines that the user is in relaxation mode, it will generate audio in a calm tone. Prompts may include phrases such as, "Generate an appropriate audio tone when the user is seeking relaxation," or "Create optimal audio information based on the user's past preference history and current mood."
[0390] This system utilizes conventional computers and smart devices as hardware, and leverages speech recognition software and generative AI models as software. As a result, users can acquire information on a daily basis that is more emotionally rich and tailored to their needs, thereby improving their quality of life.
[0391] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0392] Step 1:
[0393] The user accesses the application and enters their preferred information genre and desired playback time. This information is received by the server. The input here consists of "information genre" and "desired playback time," and the server prepares to collect information based on this. Specifically, by entering "economic news" as the information genre and "7 AM" as the desired playback time, information collection for that time period will begin.
[0394] Step 2:
[0395] The server collects relevant information from internet resources and databases based on the received information genre. It generates queries for information collection and performs database access and web crawling. The input is the "information genre," and the output is a "set of related information." For example, it might collect the latest articles related to "economic news."
[0396] Step 3:
[0397] The server analyzes and summarizes the collected relevant information using natural language processing techniques. It uses an information processing engine to summarize the input "set of relevant information" and output simplified "summary information." Specifically, it uses a text summarization algorithm to extract only the most important news points.
[0398] Step 4:
[0399] The device uses an emotion engine to recognize the user's real-time emotional state. It analyzes voice input and interactions to determine the emotional state. The input is "user voice data" and "interaction data," and the output is "the user's emotional state." Specifically, it uses speech recognition and machine learning models to determine the user's emotions from the tone of their voice and interactions.
[0400] Step 5:
[0401] The server generates audio data using a generative AI model based on the analyzed user's emotional state and summary information. The input is "summary information" and "user's emotional state," and the output is "audio data." Specifically, it generates audio using speech synthesis technology with a tone and pace that corresponds to the emotion.
[0402] Step 6:
[0403] The server performs final adjustments to the generated audio data by adding background sounds and jingles. The input here is the "audio data," and the output is the "adjusted audio data." Specifically, it selects background sounds from a pre-prepared audio library and overlays them onto the audio data.
[0404] Step 7:
[0405] The server sends the adjusted audio data to the user's terminal. The input is the "adjusted audio data," and the output is the "transfer of audio data to the user's terminal." The specific operation is the procedure of sending data over a network.
[0406] Step 8:
[0407] The user receives information by playing audio data at a time set on their device. The input is "audio data to the user's device," and the output is "audio playback." In terms of operation, the audio will play automatically at the specified time.
[0408] Through this series of steps, the system can provide personalized information tailored to the user's needs and emotions.
[0409] (Application Example 2)
[0410] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0411] Conventional voice information systems have struggled to provide information tailored to the user's emotional state, often resulting in a lack of self-optimized information. As a result, users sometimes feel that the information they receive is not sufficiently personalized. In particular, providing voice information in an appropriate tone according to emotions is currently lacking in current technology.
[0412] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0413] In this invention, the server includes means for receiving the type of information of interest and the desired playback time from the user, means for collecting information from data storage devices and network resources based on the received information type, and means for analyzing and summarizing the collected information. This makes it possible to provide personalized audio information that corresponds to the user's emotional state.
[0414] A "user" refers to an individual or organization that uses an information system or device to receive services or data.
[0415] "Types of Interested Information" refers to a classification of the categories and fields of information that a user is interested in.
[0416] "Desired playback time" refers to a specific time slot or date when a user wishes to play information or content.
[0417] A "data storage device" refers to a device or medium that stores large amounts of data and allows access and retrieval of it as needed.
[0418] "Network resources" refer to data, information, and services available on the internet and other networks.
[0419] "Methods for analyzing and summarizing information" refers to techniques for analyzing large amounts of information data, extracting key points and important elements, and making them concise.
[0420] "Audio format" refers to a format in which digital or analog data is converted into an audio signal and made playable.
[0421] "User equipment" refers to electronic devices such as computers, smartphones, and tablets that are used directly by the user.
[0422] An "emotion analysis device" refers to a device or software that analyzes a user's emotions and psychological state and makes that data available for use.
[0423] "Adjusting the tone of an audio format" refers to changing the pitch, speed, intonation, etc., of audio data to express specific emotions or moods.
[0424] The system for implementing this invention consists of a server and a user terminal working together. The server receives the type of information of interest and the desired playback time from the user. Based on this, it collects relevant information from data storage devices and network resources. Specifically, it analyzes and summarizes the information using natural language processing technologies such as Google Cloud and IBM Watson. When this summarized information is converted into audio format and sent to the user terminal, the system analyzes the user's emotions using an emotion analysis device such as Microsoft Azure's Emotion API. The tone of the audio format is then adjusted to enable the provision of appropriate information according to the emotion.
[0425] Users receive audio information via their smartphones or tablets, timed to their preferred schedule. This information is personalized according to the user's emotional state, providing a more intuitive and satisfying listening experience. For example, a feature is implemented that plays uplifting audio during the morning commute. The system applies automated learning technology based on user feedback to optimize information delivery. In the future, further personalization using more sophisticated generative AI models is being considered.
[0426] For example, if a user is determined to be in a relaxed mood on a Monday morning, they can receive audio content that includes calming background music along with the day's news and weather information. This allows users to access information optimized for their emotions, making daily information acquisition more comfortable. By applying the prompt, "Use a generative AI model to suggest audio content best suited to the user's emotions," AI-driven data processing becomes possible.
[0427] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0428] Step 1:
[0429] The user enters their type of interest and desired playback time via a terminal. The terminal sends this input to the server. The server uses the type of interest and desired playback time received from the user to prepare to access data storage devices and network resources to retrieve the necessary information.
[0430] Step 2:
[0431] The server collects data related to the user's specified type of information interest from data storage devices and network resources. The collected data is analyzed using natural language processing techniques to generate a summary. The input is the user-specified type of information, and the output is the summarized information data.
[0432] Step 3:
[0433] The server converts summarized information into audio format. The input is summarized text information, and the output is audio data. At this time, an emotion analysis device is used to analyze the user's emotional state, and the tone of the audio data is adjusted based on the results. The results of the emotion analysis are used as parameters for adjusting the audio tone.
[0434] Step 4:
[0435] The server sends the adjusted audio data to the user terminal. The input is tone-adjusted audio data, and the output is the transmission of data to the user terminal.
[0436] Step 5:
[0437] The user's device plays the received audio data at the specified playback time. The audio data being played may include background sounds and jingles to enhance the user's listening experience.
[0438] Step 6:
[0439] After listening, users send feedback to the server regarding their emotional state and satisfaction with the information. The server receives this feedback and uses machine learning algorithms to optimize information delivery. The feedback is used as data to improve future information delivery.
[0440] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0441] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0442] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.
[0443] [Third Embodiment]
[0444] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.
[0445] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.
[0446] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0447] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.
[0448] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0449] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0450] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0451] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0452] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0453] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0454] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0455] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".
[0456] The system of this invention provides personalized audio information by allowing the user to input a specific information genre and desired playback time. The system's operation is mainly performed by a server and a user terminal, which are described below.
[0457] When a user accesses the application, they select the information genres they are interested in and enter the time they want to listen to the radio each day. This confirms the requirements based on the user's specifications. Once the user enters and submits this information, the device sends the data to the server.
[0458] The server collects information in real time from the internet and existing databases based on specified information genres. By querying diverse information sources such as news sites and APIs, it gathers the latest and most relevant information. The server then summarizes the collected information using natural language processing techniques, extracting the essence of entertainment and news.
[0459] Based on this summarized information, the server generates audio data using speech synthesis technology. The audio data is customized to match the user's language and voice tone, and background sounds and jingles are added to further enhance the listening experience.
[0460] The generated audio data is sent from the server to the user's device. The user's device automatically plays this audio data at a set time, allowing the user to listen to the information hands-free.
[0461] As a concrete example, suppose a user is interested in international news and entertainment and sets their radio to tune in at 7 AM every morning. The server collects the latest information from international news sites and entertainment blogs, creates a summary, and then converts it into an audio radio program. The user's device plays the audio at 7 AM, allowing the user to listen to the latest news during their commute.
[0462] Furthermore, based on user feedback and usage history, the server optimizes the selection of information it provides using machine learning technology. This allows the system to provide information that better matches user preferences over time. In this way, information tailored to user needs can be efficiently delivered.
[0463] The following describes the processing flow.
[0464] Step 1:
[0465] The user launches the application and enters their preferred information genres and desired radio playback time. This information is then sent from the user's device to the server.
[0466] Step 2:
[0467] Based on the genre of information it receives, the server accesses relevant internet resources and databases to collect the latest information.
[0468] Step 3:
[0469] The server analyzes the collected information using natural language processing technology, extracts important information, and creates a summary. Here, it evaluates the relevance and news value of the information and tailors it to the user's interests.
[0470] Step 4:
[0471] The server inputs the generated summary into a speech synthesis system and converts it into audio data according to the specified language and voice tone. Furthermore, it enhances the listening experience by adding background sounds and jingles to the audio data.
[0472] Step 5:
[0473] The server sends the generated audio data to the user's device. The transmission is scheduled according to the playback time set by the user.
[0474] Step 6:
[0475] The device plays audio data at a set time. Users can listen to personalized information through the audio.
[0476] Step 7:
[0477] Users can provide feedback after listening. This feedback is sent from the device to the server and used to improve the information provided.
[0478] Step 8:
[0479] The server accumulates user feedback and past usage history, and applies machine learning to optimize the information collection and delivery process. This continuous learning process ensures that subsequent information is better aligned with the user's needs.
[0480] (Example 1)
[0481] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0482] In recent years, in our information-saturated society, users are required to efficiently and accurately acquire information that matches their interests and lifestyles. However, conventional information delivery systems have struggled to select and personalize information suitable for users from a vast amount of data. Furthermore, audio-based information lacks the flexibility to be customized to the user's preferences.
[0483] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0484] In this invention, the server includes means for receiving information areas of interest and desired playback time from the user; means for collecting information from a data store and a communication network based on the received information areas; means for using natural language processing technology to process and summarize the collected information; means for converting the summarized information into audio data using speech synthesis technology; and means for adding background sounds and sound effects to the audio data. This enables the user to effectively obtain personalized audio information tailored to their interests and lifestyle at a specified time.
[0485] A "user" is an entity that receives services from an information system that meet their own interests and needs.
[0486] "Areas of interest" refers to categories of specific topics or content that a user is interested in.
[0487] "Desired playback time" refers to the specific time period that the user specifies for listening to the audio information.
[0488] A "data store" refers to a storage device used to store and manage collected information and data.
[0489] A "communication network" refers to the infrastructure that enables the exchange of information and data, and includes the internet.
[0490] "Natural language processing technology" is a general term for technologies that enable computers to understand and process human language.
[0491] "Speech synthesis technology" refers to the technology that converts text information into speech.
[0492] "Background sounds and sound effects" refer to sounds used to complement the main audio and improve the listening experience.
[0493] An "automated learning algorithm" refers to a process that uses machine learning techniques to automatically learn patterns and rules from data and make decisions based on those patterns.
[0494] A "generative AI model" refers to an artificial intelligence model that is trained for a specific purpose and generates text, audio, images, and other data.
[0495] A "prompt statement" refers to an instruction given to a generative AI model to perform a specific task.
[0496] The system of the present invention provides customized audio information by allowing the user to input a specific information field and desired playback time. The system's implementation is primarily carried out through the operation of a server, a terminal, and the user.
[0497] Users use a device with a dedicated application installed to input their areas of interest (e.g., international news, entertainment) and the time they wish to receive the information (e.g., 7 AM every morning). This information is transmitted to the server via a communication network.
[0498] Based on the information received, the server collects information in real time via data stores and the internet. For example, it retrieves the latest information from news site APIs and RSS feeds. Query techniques are used here to efficiently collect data.
[0499] Next, the server processes the collected information using natural language processing techniques. This process uses a generative AI model to summarize the text. A concrete example of a prompt is, "Summarize the given news in three lines." This allows for the extraction of only the necessary information.
[0500] The summarized information is converted into audio format by the server using speech synthesis technology. This conversion utilizes a speech synthesis engine, and the voice tone is adjusted according to the user's language and preferences. Furthermore, background sounds and sound effects are added to enhance the listening experience.
[0501] The generated audio data is sent from the server to the user's device. The device plays the audio at the specified time, allowing the user to efficiently listen to the information while engaging in their daily activities. For example, it is possible to check the latest international news in audio format while commuting.
[0502] Furthermore, based on user feedback and past usage history, the server applies machine learning algorithms to optimize the selection of information it provides. This enhances its ability to deliver information that better matches the user's interests and preferences over time.
[0503] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0504] Step 1:
[0505] Users input their areas of interest and desired viewing time through a dedicated application. The user's device structures this input data in JSON format and sends it to the server. The input includes specific information such as "International News" and "Every morning at 7:00".
[0506] Step 2:
[0507] The server parses the received JSON data and collects information from relevant data stores and online sources (e.g., news APIs, RSS feeds) based on areas of interest. The server uses queries to search for information that matches the specified conditions and retrieves the latest data. This process outputs relevant information, such as specific news articles or entertainment news.
[0508] Step 3:
[0509] The server processes the collected information using natural language processing technology. Specifically, it summarizes long texts by prompting a generative AI model with the message, "Summarize the given news in three lines." The output is a summary that focuses on the points important to the user.
[0510] Step 4:
[0511] The server converts the summarized information into audio data using speech synthesis technology. The speech synthesis engine generates speech that matches the language and tone of voice specified by the user, and further enhances the listening experience by adding background music and sound effects. The generated audio file is then output.
[0512] Step 5:
[0513] The server sends audio data to the user's device. This transmission uses data transfer technology over the internet. The user's device prepares to start playing this audio data at a specified playback time (e.g., 7 AM every morning) using an alarm or notification.
[0514] Step 6:
[0515] The user's device plays audio data at the set time. Users can listen to it during their commute or while doing household chores without any special operation. This allows users to efficiently acquire personalized information.
[0516] Step 7:
[0517] The server collects user feedback and past usage history, and applies machine learning algorithms to improve the accuracy and relevance of the information it provides. This enables the provision of information tailored to the user's preferences over time.
[0518] (Application Example 1)
[0519] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0520] Due to information overload, users face the challenge of not being able to efficiently select the information they need and process it at their preferred time. Furthermore, the lack of sufficient personalized information tailored to users' interests and needs means they often receive a large amount of irrelevant information. Additionally, the lack of a centralized method for receiving information across multiple media formats increases the effort required for users.
[0521] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0522] In this invention, the server includes means for receiving the user's areas of interest and desired playback time, means for collecting information from information sources, and means for analyzing and summarizing the collected information. This allows the user to select information according to their interests and needs and listen to it efficiently at a specified time.
[0523] A "user" is someone who uses a device to receive information, specifying a particular information genre and desired playback time.
[0524] "Interested Information Genres" refer to categories of information that users are particularly interested in and wish to collect or share.
[0525] "Desired playback time" refers to the specific time when the user wishes to have the information played back.
[0526] "Information sources" refer to resources such as databases, websites, and APIs that provide information available on the internet.
[0527] "Means of information gathering" refer to the methods and techniques for extracting necessary information from information sources.
[0528] "Methods for analysis and summarization" refer to techniques for processing collected information, extracting its key points, and presenting them concisely.
[0529] "Audio data" refers to data that has been converted into audio format from summarized information intended to be provided to users.
[0530] "Background sound" refers to additional music or sound effects played alongside audio data, designed to enhance the listening experience.
[0531] A "user terminal" is an electronic device used to receive and play audio data.
[0532] "Means of playback" refers to technologies that enable users to listen to audio data on their devices.
[0533] "Feedback" refers to information and opinions provided by users, which are used to improve the services provided.
[0534] A "learning algorithm" is a computational method for optimizing information delivery based on feedback and historical data.
[0535] "Multiple media formats" refers to providing information in various formats, including not only audio, but also text, images, and more.
[0536] To implement this invention, a system is used in which a server and a user terminal cooperate. When a user uses a smartphone or other device to input and submit their preferred information genre and desired playback time, the server accepts this input and begins processing it.
[0537] The server collects necessary information from existing databases and various information sources on the internet. Specifically, it accesses news databases and websites such as blogs via APIs to retrieve content related to a specified genre. Next, it uses natural language processing software (e.g., NLTK or spaCy) running on the server side to analyze and summarize the collected information. This extracts the essence of the information and presents it in a format that is easy for the user to understand.
[0538] Next, the summarized information is converted into audio data using speech synthesis technology, such as Amazon's speech synthesis service. Background sounds may be added to this audio data to enhance listening comfort. The generated audio data is sent from the server to the user's device and automatically played at a set time. This allows users to obtain necessary information hands-free in their daily lives.
[0539] The system collects user feedback and past usage history, and uses a learning algorithm to optimize the information it provides. This allows for more personalized information delivery over time.
[0540] For example, if a user requests "technical news" and "8 AM every morning," a summary audio of the latest technology-related news will automatically play at that time every morning. An example of a prompt might be, "Generate a summary of technical news and output it in podcast format. The playback time should be 10 minutes." This system allows users to efficiently and continuously obtain the latest information tailored to their interests.
[0541] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0542] Step 1:
[0543] Users access the application using their smartphones or other devices and enter their preferred information genres and desired playback times. The application retrieves the entered data and sends it to the server. This input includes information such as "technical news" and "every morning at 8 AM," which serves as the basic data for subsequent processing.
[0544] Step 2:
[0545] The server collects appropriate information from its database of information sources and the internet based on the information genre received from the user. The server uses an API to retrieve information of the specified genre from news sites and specialized blogs. Here, the input is the information genre requested by the user, and the output is a raw data set related to that genre.
[0546] Step 3:
[0547] The server analyzes and summarizes the collected information using natural language processing software. Specifically, it uses natural language processing tools (e.g., spaCy) to analyze text and extract and compress important elements. The input is raw information data, and the output is summarized information provided to the user.
[0548] Step 4:
[0549] Based on the summarized information, the server generates audio data using speech synthesis technology. The summarized data is input into speech synthesis software (e.g., Amazon Polly) and converted into a speech format that closely resembles a human voice. Background sounds are added to this generated audio data as needed. The output is the final audio file.
[0550] Step 5:
[0551] The generated audio data is sent from the server to the user's terminal. The terminal is configured to automatically play the audio data at the specified playback time. The output here is the audio information that the user hears on the terminal.
[0552] Step 6:
[0553] Users provide feedback after listening. The server receives this feedback and applies machine learning algorithms to optimize future information delivery to better suit the user's preferences. The input is the user's feedback and behavioral history, and the output is a refined information delivery method.
[0554] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0555] This invention is a system for providing personalized voice information that takes user emotions into consideration. By incorporating an emotion engine, this system enables information provision based on the user's emotional state.
[0556] First, the user accesses the application and enters their preferred information genres and desired playback time. The user can also store their past preference and sentiment history in a database to further personalize the system. This allows for more refined information delivery.
[0557] The server receives information sent by the user and gathers information from internet resources and databases related to the specified information genre. The collected information is analyzed and summarized using natural language processing technology. Meanwhile, an emotion engine built into the terminal recognizes the user's emotions in real time through their voice and interactions and sends that data to the server.
[0558] The server uses emotion engine data to adjust the tone of the audio data to match the user's current emotions. For example, if the server determines that the user is stressed, it will adjust the audio to provide relaxing information and select a more calming tone. Background sounds and jingles are added to this audio data to further enhance the listening experience.
[0559] As a concrete example, suppose a user enjoys sports news and typically wants to refresh themselves during their morning commute. The server collects sports news, and an emotion engine analyzes the user's mood for the day, generating a relaxed-toned audio. The generated audio is sent to the user's device and automatically plays at a set time. The user can then listen to the radio while receiving information optimized for their mood that day.
[0560] In this way, the system can adapt to the user's emotions and preferences, providing more personalized voice information. This enables users to make better use of their time in daily life and improve the quality of information they receive.
[0561] The following describes the processing flow.
[0562] Step 1:
[0563] The user launches the application and enters their preferred information genres, desired playback time, and current emotional state. Past preference history is updated as needed.
[0564] Step 2:
[0565] The emotion engine built into the device recognizes the user's emotional state in real time through their voice and facial expressions, and sends that data to the server.
[0566] Step 3:
[0567] Based on the received information genre and sentiment data, the server queries relevant internet resources and databases to collect the latest information.
[0568] Step 4:
[0569] The server analyzes the collected information using natural language processing technology, extracting and summarizing important and relevant information. In doing so, it takes into account the user's emotional state.
[0570] Step 5:
[0571] The server adjusts the tone and content of the audio data to match the analyzed emotional data. For example, if the user wants to relax, it will choose a calm tone and an interesting topic.
[0572] Step 6:
[0573] The server uses speech synthesis technology to convert the summarized information into audio data, and adds background sounds and jingles to enhance the listening experience.
[0574] Step 7:
[0575] The server sends the generated audio data to the user's device. The audio data becomes available for playback according to the user's set playback time.
[0576] Step 8:
[0577] The device automatically plays audio data at a specified time. The user listens to information optimized for their emotional state and interests.
[0578] Step 9:
[0579] Users provide feedback via their device after listening. This feedback is sent to the server and used to optimize future information delivery.
[0580] Step 10:
[0581] The server accumulates user feedback and past usage history, and applies machine learning to further optimize its information delivery strategy. This update is continuous and becomes increasingly personalized.
[0582] (Example 2)
[0583] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0584] When providing information to users, simply broadcasting standardized information via audio presents a challenge in adequately satisfying the emotional needs and preferences of individual users. Furthermore, improving the quality of information provided and the user experience requires considering the user's individual emotional state and past preference history, but there is a lack of systems that can appropriately reflect this.
[0585] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0586] In this invention, the server includes means for receiving information genres and desired playback times from the user, means for collecting information from information resources, and means for analyzing and summarizing the collected information using natural language processing technology. This makes it possible to provide personalized audio information tailored to the user's emotional state.
[0587] "Information genre" refers to the specific types or categories of information that users are interested in.
[0588] "Desired playback time" refers to the time period or duration during which the user wishes to receive the information.
[0589] "Information resources" refers to a collection of internet resources and databases that are accessed to obtain information.
[0590] "Natural language processing technology" refers to methods and techniques that enable computers to understand and process the language that humans use on a daily basis.
[0591] A "generative AI model" refers to algorithms and technologies that use artificial intelligence to generate text, speech, and other data.
[0592] An "emotion engine" refers to a technology or system that recognizes a user's emotional state and performs processing according to that state.
[0593] "Machine learning" refers to the technology of learning patterns from data to automate or improve specific tasks.
[0594] "Tone adjustment" refers to modifying the sound quality and expression of audio data to match the user's emotions and preferences.
[0595] This system aims to provide personalized audio information based on the user's emotional state and preferences. First, the user enters their preferred information genres and desired playback time through the application. This information is received by the server. The server utilizes databases and the internet to collect content from information resources related to the specified genres. The received information is analyzed and summarized using natural language processing techniques. Specifically, text summarization models and topic modeling are used to concisely summarize vast amounts of information.
[0596] The device incorporates an emotion engine that recognizes emotions in real time through the user's voice and interactions. The recognized emotion data is sent to a server and used to generate audio data. The server uses this emotion data and collected information to create audio data using a generative AI model. The tone and pace of the voice are adjusted to suit the user's emotional state. Furthermore, background sounds and jingles are added to the audio data to enhance the listening experience.
[0597] The final audio data is sent to the user's device and automatically played at the specified time. For example, if a user requests to "listen to the morning news for 10 minutes," the server will collect the latest news, and if the emotion engine determines that the user is in relaxation mode, it will generate audio in a calm tone. Prompts may include phrases such as, "Generate an appropriate audio tone when the user is seeking relaxation," or "Create optimal audio information based on the user's past preference history and current mood."
[0598] This system utilizes conventional computers and smart devices as hardware, and leverages speech recognition software and generative AI models as software. As a result, users can acquire information on a daily basis that is more emotionally rich and tailored to their needs, thereby improving their quality of life.
[0599] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0600] Step 1:
[0601] The user accesses the application and enters their preferred information genre and desired playback time. This information is received by the server. The input here consists of "information genre" and "desired playback time," and the server prepares to collect information based on this. Specifically, by entering "economic news" as the information genre and "7 AM" as the desired playback time, information collection for that time period will begin.
[0602] Step 2:
[0603] The server collects relevant information from internet resources and databases based on the received information genre. It generates queries for information collection and performs database access and web crawling. The input is the "information genre," and the output is a "set of related information." For example, it might collect the latest articles related to "economic news."
[0604] Step 3:
[0605] The server analyzes and summarizes the collected relevant information using natural language processing techniques. It uses an information processing engine to summarize the input "set of relevant information" and output simplified "summary information." Specifically, it uses a text summarization algorithm to extract only the most important news points.
[0606] Step 4:
[0607] The device uses an emotion engine to recognize the user's real-time emotional state. It analyzes voice input and interactions to determine the emotional state. The input is "user voice data" and "interaction data," and the output is "the user's emotional state." Specifically, it uses speech recognition and machine learning models to determine the user's emotions from the tone of their voice and interactions.
[0608] Step 5:
[0609] The server generates audio data using a generative AI model based on the analyzed user's emotional state and summary information. The input is "summary information" and "user's emotional state," and the output is "audio data." Specifically, it generates audio using speech synthesis technology with a tone and pace that corresponds to the emotion.
[0610] Step 6:
[0611] The server performs final adjustments to the generated audio data by adding background sounds and jingles. The input here is the "audio data," and the output is the "adjusted audio data." Specifically, it selects background sounds from a pre-prepared audio library and overlays them onto the audio data.
[0612] Step 7:
[0613] The server sends the adjusted audio data to the user's terminal. The input is the "adjusted audio data," and the output is the "transfer of audio data to the user's terminal." The specific operation is the procedure of sending data over a network.
[0614] Step 8:
[0615] The user receives information by playing audio data at a time set on their device. The input is "audio data to the user's device," and the output is "audio playback." In terms of operation, the audio will play automatically at the specified time.
[0616] Through this series of steps, the system can provide personalized information tailored to the user's needs and emotions.
[0617] (Application Example 2)
[0618] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0619] Conventional voice information systems have struggled to provide information tailored to the user's emotional state, often resulting in a lack of self-optimized information. As a result, users sometimes feel that the information they receive is not sufficiently personalized. In particular, providing voice information in an appropriate tone according to emotions is currently lacking in current technology.
[0620] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0621] In this invention, the server includes means for receiving the type of information of interest and the desired playback time from the user, means for collecting information from data storage devices and network resources based on the received information type, and means for analyzing and summarizing the collected information. This makes it possible to provide personalized audio information that corresponds to the user's emotional state.
[0622] A "user" refers to an individual or organization that uses an information system or device to receive services or data.
[0623] "Types of Interested Information" refers to a classification of the categories and fields of information that a user is interested in.
[0624] "Desired playback time" refers to a specific time slot or date when a user wishes to play information or content.
[0625] A "data storage device" refers to a device or medium that stores large amounts of data and allows access and retrieval of it as needed.
[0626] "Network resources" refer to data, information, and services available on the internet and other networks.
[0627] "Methods for analyzing and summarizing information" refers to techniques for analyzing large amounts of information data, extracting key points and important elements, and making them concise.
[0628] "Audio format" refers to a format in which digital or analog data is converted into an audio signal and made playable.
[0629] "User equipment" refers to electronic devices such as computers, smartphones, and tablets that are used directly by the user.
[0630] An "emotion analysis device" refers to a device or software that analyzes a user's emotions and psychological state and makes that data available for use.
[0631] "Adjusting the tone of an audio format" refers to changing the pitch, speed, intonation, etc., of audio data to express specific emotions or moods.
[0632] The system for implementing this invention consists of a server and a user terminal working together. The server receives the type of information of interest and the desired playback time from the user. Based on this, it collects relevant information from data storage devices and network resources. Specifically, it analyzes and summarizes the information using natural language processing technologies such as Google Cloud and IBM Watson. When this summarized information is converted into audio format and sent to the user terminal, the system analyzes the user's emotions using an emotion analysis device such as Microsoft Azure's Emotion API. The tone of the audio format is then adjusted to enable the provision of appropriate information according to the emotion.
[0633] Users receive audio information via their smartphones or tablets, timed to their preferred schedule. This information is personalized according to the user's emotional state, providing a more intuitive and satisfying listening experience. For example, a feature is implemented that plays uplifting audio during the morning commute. The system applies automated learning technology based on user feedback to optimize information delivery. In the future, further personalization using more sophisticated generative AI models is being considered.
[0634] For example, if a user is determined to be in a relaxed mood on a Monday morning, they can receive audio content that includes calming background music along with the day's news and weather information. This allows users to access information optimized for their emotions, making daily information acquisition more comfortable. By applying the prompt, "Use a generative AI model to suggest audio content best suited to the user's emotions," AI-driven data processing becomes possible.
[0635] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0636] Step 1:
[0637] The user enters their type of interest and desired playback time via a terminal. The terminal sends this input to the server. The server uses the type of interest and desired playback time received from the user to prepare to access data storage devices and network resources to retrieve the necessary information.
[0638] Step 2:
[0639] The server collects data related to the user's specified type of information interest from data storage devices and network resources. The collected data is analyzed using natural language processing techniques to generate a summary. The input is the user-specified type of information, and the output is the summarized information data.
[0640] Step 3:
[0641] The server converts summarized information into audio format. The input is summarized text information, and the output is audio data. At this time, an emotion analysis device is used to analyze the user's emotional state, and the tone of the audio data is adjusted based on the results. The results of the emotion analysis are used as parameters for adjusting the audio tone.
[0642] Step 4:
[0643] The server sends the adjusted audio data to the user terminal. The input is tone-adjusted audio data, and the output is the transmission of data to the user terminal.
[0644] Step 5:
[0645] The user's device plays the received audio data at the specified playback time. The audio data being played may include background sounds and jingles to enhance the user's listening experience.
[0646] Step 6:
[0647] After listening, users send feedback to the server regarding their emotional state and satisfaction with the information. The server receives this feedback and uses machine learning algorithms to optimize information delivery. The feedback is used as data to improve future information delivery.
[0648] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0649] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0650] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.
[0651] [Fourth Embodiment]
[0652] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.
[0653] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.
[0654] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0655] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.
[0656] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0657] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0658] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0659] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.
[0660] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0661] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0662] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0663] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0664] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0665] The system of this invention provides personalized audio information by allowing the user to input a specific information genre and desired playback time. The system's operation is mainly performed by a server and a user terminal, which are described below.
[0666] When a user accesses the application, they select the information genres they are interested in and enter the time they want to listen to the radio each day. This confirms the requirements based on the user's specifications. Once the user enters and submits this information, the device sends the data to the server.
[0667] The server collects information in real time from the internet and existing databases based on specified information genres. By querying diverse information sources such as news sites and APIs, it gathers the latest and most relevant information. The server then summarizes the collected information using natural language processing techniques, extracting the essence of entertainment and news.
[0668] Based on this summarized information, the server generates audio data using speech synthesis technology. The audio data is customized to match the user's language and voice tone, and background sounds and jingles are added to further enhance the listening experience.
[0669] The generated audio data is sent from the server to the user's device. The user's device automatically plays this audio data at a set time, allowing the user to listen to the information hands-free.
[0670] As a concrete example, suppose a user is interested in international news and entertainment and sets their radio to tune in at 7 AM every morning. The server collects the latest information from international news sites and entertainment blogs, creates a summary, and then converts it into an audio radio program. The user's device plays the audio at 7 AM, allowing the user to listen to the latest news during their commute.
[0671] Furthermore, based on user feedback and usage history, the server optimizes the selection of information it provides using machine learning technology. This allows the system to provide information that better matches user preferences over time. In this way, information tailored to user needs can be efficiently delivered.
[0672] The following describes the processing flow.
[0673] Step 1:
[0674] The user launches the application and enters their preferred information genres and desired radio playback time. This information is then sent from the user's device to the server.
[0675] Step 2:
[0676] Based on the genre of information it receives, the server accesses relevant internet resources and databases to collect the latest information.
[0677] Step 3:
[0678] The server analyzes the collected information using natural language processing technology, extracts important information, and creates a summary. Here, it evaluates the relevance and news value of the information and tailors it to the user's interests.
[0679] Step 4:
[0680] The server inputs the generated summary into a speech synthesis system and converts it into audio data according to the specified language and voice tone. Furthermore, it enhances the listening experience by adding background sounds and jingles to the audio data.
[0681] Step 5:
[0682] The server sends the generated audio data to the user's device. The transmission is scheduled according to the playback time set by the user.
[0683] Step 6:
[0684] The device plays audio data at a set time. Users can listen to personalized information through the audio.
[0685] Step 7:
[0686] Users can provide feedback after listening. This feedback is sent from the device to the server and used to improve the information provided.
[0687] Step 8:
[0688] The server accumulates user feedback and past usage history, and applies machine learning to optimize the information collection and delivery process. This continuous learning process ensures that subsequent information is better aligned with the user's needs.
[0689] (Example 1)
[0690] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0691] In recent years, in our information-saturated society, users are required to efficiently and accurately acquire information that matches their interests and lifestyles. However, conventional information delivery systems have struggled to select and personalize information suitable for users from a vast amount of data. Furthermore, audio-based information lacks the flexibility to be customized to the user's preferences.
[0692] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0693] In this invention, the server includes means for receiving information areas of interest and desired playback time from the user; means for collecting information from a data store and a communication network based on the received information areas; means for using natural language processing technology to process and summarize the collected information; means for converting the summarized information into audio data using speech synthesis technology; and means for adding background sounds and sound effects to the audio data. This enables the user to effectively obtain personalized audio information tailored to their interests and lifestyle at a specified time.
[0694] A "user" is an entity that receives services from an information system that meet their own interests and needs.
[0695] "Areas of interest" refers to categories of specific topics or content that a user is interested in.
[0696] "Desired playback time" refers to the specific time period that the user specifies for listening to the audio information.
[0697] A "data store" refers to a storage device used to store and manage collected information and data.
[0698] A "communication network" refers to the infrastructure that enables the exchange of information and data, and includes the internet.
[0699] "Natural language processing technology" is a general term for technologies that enable computers to understand and process human language.
[0700] "Speech synthesis technology" refers to the technology that converts text information into speech.
[0701] "Background sounds and sound effects" refer to sounds used to complement the main audio and improve the listening experience.
[0702] An "automated learning algorithm" refers to a process that uses machine learning techniques to automatically learn patterns and rules from data and make decisions based on those patterns.
[0703] A "generative AI model" refers to an artificial intelligence model that is trained for a specific purpose and generates text, audio, images, and other data.
[0704] A "prompt statement" refers to an instruction given to a generative AI model to perform a specific task.
[0705] The system of the present invention provides customized audio information by allowing the user to input a specific information field and desired playback time. The system's implementation is primarily carried out through the operation of a server, a terminal, and the user.
[0706] Users use a device with a dedicated application installed to input their areas of interest (e.g., international news, entertainment) and the time they wish to receive the information (e.g., 7 AM every morning). This information is transmitted to the server via a communication network.
[0707] Based on the information received, the server collects information in real time via data stores and the internet. For example, it retrieves the latest information from news site APIs and RSS feeds. Query techniques are used here to efficiently collect data.
[0708] Next, the server processes the collected information using natural language processing techniques. This process uses a generative AI model to summarize the text. A concrete example of a prompt is, "Summarize the given news in three lines." This allows for the extraction of only the necessary information.
[0709] The summarized information is converted into audio format by the server using speech synthesis technology. This conversion utilizes a speech synthesis engine, and the voice tone is adjusted according to the user's language and preferences. Furthermore, background sounds and sound effects are added to enhance the listening experience.
[0710] The generated audio data is sent from the server to the user's device. The device plays the audio at the specified time, allowing the user to efficiently listen to the information while engaging in their daily activities. For example, it is possible to check the latest international news in audio format while commuting.
[0711] Furthermore, based on user feedback and past usage history, the server applies machine learning algorithms to optimize the selection of information it provides. This enhances its ability to deliver information that better matches the user's interests and preferences over time.
[0712] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0713] Step 1:
[0714] Users input their areas of interest and desired viewing time through a dedicated application. The user's device structures this input data in JSON format and sends it to the server. The input includes specific information such as "International News" and "Every morning at 7:00".
[0715] Step 2:
[0716] The server parses the received JSON data and collects information from relevant data stores and online sources (e.g., news APIs, RSS feeds) based on areas of interest. The server uses queries to search for information that matches the specified conditions and retrieves the latest data. This process outputs relevant information, such as specific news articles or entertainment news.
[0717] Step 3:
[0718] The server processes the collected information using natural language processing technology. Specifically, it summarizes long texts by prompting a generative AI model with the message, "Summarize the given news in three lines." The output is a summary that focuses on the points important to the user.
[0719] Step 4:
[0720] The server converts the summarized information into audio data using speech synthesis technology. The speech synthesis engine generates speech that matches the language and tone of voice specified by the user, and further enhances the listening experience by adding background music and sound effects. The generated audio file is then output.
[0721] Step 5:
[0722] The server sends audio data to the user's device. This transmission uses data transfer technology over the internet. The user's device prepares to start playing this audio data at a specified playback time (e.g., 7 AM every morning) using an alarm or notification.
[0723] Step 6:
[0724] The user's device plays audio data at the set time. Users can listen to it during their commute or while doing household chores without any special operation. This allows users to efficiently acquire personalized information.
[0725] Step 7:
[0726] The server collects user feedback and past usage history, and applies machine learning algorithms to improve the accuracy and relevance of the information it provides. This enables the provision of information tailored to the user's preferences over time.
[0727] (Application Example 1)
[0728] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0729] Due to information overload, users face the challenge of not being able to efficiently select the information they need and process it at their preferred time. Furthermore, the lack of sufficient personalized information tailored to users' interests and needs means they often receive a large amount of irrelevant information. Additionally, the lack of a centralized method for receiving information across multiple media formats increases the effort required for users.
[0730] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0731] In this invention, the server includes means for receiving the user's areas of interest and desired playback time, means for collecting information from information sources, and means for analyzing and summarizing the collected information. This allows the user to select information according to their interests and needs and listen to it efficiently at a specified time.
[0732] A "user" is someone who uses a device to receive information, specifying a particular information genre and desired playback time.
[0733] "Interested Information Genres" refer to categories of information that users are particularly interested in and wish to collect or share.
[0734] "Desired playback time" refers to the specific time when the user wishes to have the information played back.
[0735] "Information sources" refer to resources such as databases, websites, and APIs that provide information available on the internet.
[0736] "Means of information gathering" refer to the methods and techniques for extracting necessary information from information sources.
[0737] "Methods for analysis and summarization" refer to techniques for processing collected information, extracting its key points, and presenting them concisely.
[0738] "Audio data" refers to data that has been converted into audio format from summarized information intended to be provided to users.
[0739] "Background sound" refers to additional music or sound effects played alongside audio data, designed to enhance the listening experience.
[0740] A "user terminal" is an electronic device used to receive and play audio data.
[0741] "Means of playback" refers to technologies that enable users to listen to audio data on their devices.
[0742] "Feedback" refers to information and opinions provided by users, which are used to improve the services provided.
[0743] A "learning algorithm" is a computational method for optimizing information delivery based on feedback and historical data.
[0744] "Multiple media formats" refers to providing information in various formats, including not only audio, but also text, images, and more.
[0745] To implement this invention, a system is used in which a server and a user terminal cooperate. When a user uses a smartphone or other device to input and submit their preferred information genre and desired playback time, the server accepts this input and begins processing it.
[0746] The server collects necessary information from existing databases and various information sources on the internet. Specifically, it accesses news databases and websites such as blogs via APIs to retrieve content related to a specified genre. Next, it uses natural language processing software (e.g., NLTK or spaCy) running on the server side to analyze and summarize the collected information. This extracts the essence of the information and presents it in a format that is easy for the user to understand.
[0747] Next, the summarized information is converted into audio data using speech synthesis technology, such as Amazon's speech synthesis service. Background sounds may be added to this audio data to enhance listening comfort. The generated audio data is sent from the server to the user's device and automatically played at a set time. This allows users to obtain necessary information hands-free in their daily lives.
[0748] The system collects user feedback and past usage history, and uses a learning algorithm to optimize the information it provides. This allows for more personalized information delivery over time.
[0749] For example, if a user requests "technical news" and "8 AM every morning," a summary audio of the latest technology-related news will automatically play at that time every morning. An example of a prompt might be, "Generate a summary of technical news and output it in podcast format. The playback time should be 10 minutes." This system allows users to efficiently and continuously obtain the latest information tailored to their interests.
[0750] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0751] Step 1:
[0752] Users access the application using their smartphones or other devices and enter their preferred information genres and desired playback times. The application retrieves the entered data and sends it to the server. This input includes information such as "technical news" and "every morning at 8 AM," which serves as the basic data for subsequent processing.
[0753] Step 2:
[0754] The server collects appropriate information from its database of information sources and the internet based on the information genre received from the user. The server uses an API to retrieve information of the specified genre from news sites and specialized blogs. Here, the input is the information genre requested by the user, and the output is a raw data set related to that genre.
[0755] Step 3:
[0756] The server analyzes and summarizes the collected information using natural language processing software. Specifically, it uses natural language processing tools (e.g., spaCy) to analyze text and extract and compress important elements. The input is raw information data, and the output is summarized information provided to the user.
[0757] Step 4:
[0758] Based on the summarized information, the server generates audio data using speech synthesis technology. The summarized data is input into speech synthesis software (e.g., Amazon Polly) and converted into a speech format that closely resembles a human voice. Background sounds are added to this generated audio data as needed. The output is the final audio file.
[0759] Step 5:
[0760] The generated audio data is sent from the server to the user's terminal. The terminal is configured to automatically play the audio data at the specified playback time. The output here is the audio information that the user hears on the terminal.
[0761] Step 6:
[0762] Users provide feedback after listening. The server receives this feedback and applies machine learning algorithms to optimize future information delivery to better suit the user's preferences. The input is the user's feedback and behavioral history, and the output is a refined information delivery method.
[0763] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0764] This invention is a system for providing personalized voice information that takes user emotions into consideration. By incorporating an emotion engine, this system enables information provision based on the user's emotional state.
[0765] First, the user accesses the application and enters their preferred information genres and desired playback time. The user can also store their past preference and sentiment history in a database to further personalize the system. This allows for more refined information delivery.
[0766] The server receives information sent by the user and gathers information from internet resources and databases related to the specified information genre. The collected information is analyzed and summarized using natural language processing technology. Meanwhile, an emotion engine built into the terminal recognizes the user's emotions in real time through their voice and interactions and sends that data to the server.
[0767] The server uses emotion engine data to adjust the tone of the audio data to match the user's current emotions. For example, if the server determines that the user is stressed, it will adjust the audio to provide relaxing information and select a more calming tone. Background sounds and jingles are added to this audio data to further enhance the listening experience.
[0768] As a concrete example, suppose a user enjoys sports news and typically wants to refresh themselves during their morning commute. The server collects sports news, and an emotion engine analyzes the user's mood for the day, generating a relaxed-toned audio. The generated audio is sent to the user's device and automatically plays at a set time. The user can then listen to the radio while receiving information optimized for their mood that day.
[0769] In this way, the system can adapt to the user's emotions and preferences, providing more personalized voice information. This enables users to make better use of their time in daily life and improve the quality of information they receive.
[0770] The following describes the processing flow.
[0771] Step 1:
[0772] The user launches the application and enters their preferred information genres, desired playback time, and current emotional state. Past preference history is updated as needed.
[0773] Step 2:
[0774] The emotion engine built into the device recognizes the user's emotional state in real time through their voice and facial expressions, and sends that data to the server.
[0775] Step 3:
[0776] Based on the received information genre and sentiment data, the server queries relevant internet resources and databases to collect the latest information.
[0777] Step 4:
[0778] The server analyzes the collected information using natural language processing technology, extracting and summarizing important and relevant information. In doing so, it takes into account the user's emotional state.
[0779] Step 5:
[0780] The server adjusts the tone and content of the audio data to match the analyzed emotional data. For example, if the user wants to relax, it will choose a calm tone and an interesting topic.
[0781] Step 6:
[0782] The server uses speech synthesis technology to convert the summarized information into audio data, and adds background sounds and jingles to enhance the listening experience.
[0783] Step 7:
[0784] The server sends the generated audio data to the user's device. The audio data becomes available for playback according to the user's set playback time.
[0785] Step 8:
[0786] The device automatically plays audio data at a specified time. The user listens to information optimized for their emotional state and interests.
[0787] Step 9:
[0788] Users provide feedback via their device after listening. This feedback is sent to the server and used to optimize future information delivery.
[0789] Step 10:
[0790] The server accumulates user feedback and past usage history, and applies machine learning to further optimize its information delivery strategy. This update is continuous and becomes increasingly personalized.
[0791] (Example 2)
[0792] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0793] When providing information to users, simply broadcasting standardized information via audio presents a challenge in adequately satisfying the emotional needs and preferences of individual users. Furthermore, improving the quality of information provided and the user experience requires considering the user's individual emotional state and past preference history, but there is a lack of systems that can appropriately reflect this.
[0794] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0795] In this invention, the server includes means for receiving information genres and desired playback times from the user, means for collecting information from information resources, and means for analyzing and summarizing the collected information using natural language processing technology. This makes it possible to provide personalized audio information tailored to the user's emotional state.
[0796] "Information genre" refers to the specific types or categories of information that users are interested in.
[0797] "Desired playback time" refers to the time period or duration during which the user wishes to receive the information.
[0798] "Information resources" refers to a collection of internet resources and databases that are accessed to obtain information.
[0799] "Natural language processing technology" refers to methods and techniques that enable computers to understand and process the language that humans use on a daily basis.
[0800] A "generative AI model" refers to algorithms and technologies that use artificial intelligence to generate text, speech, and other data.
[0801] An "emotion engine" refers to a technology or system that recognizes a user's emotional state and performs processing according to that state.
[0802] "Machine learning" refers to the technology of learning patterns from data to automate or improve specific tasks.
[0803] "Tone adjustment" refers to modifying the sound quality and expression of audio data to match the user's emotions and preferences.
[0804] This system aims to provide personalized audio information based on the user's emotional state and preferences. First, the user enters their preferred information genres and desired playback time through the application. This information is received by the server. The server utilizes databases and the internet to collect content from information resources related to the specified genres. The received information is analyzed and summarized using natural language processing techniques. Specifically, text summarization models and topic modeling are used to concisely summarize vast amounts of information.
[0805] The device incorporates an emotion engine that recognizes emotions in real time through the user's voice and interactions. The recognized emotion data is sent to a server and used to generate audio data. The server uses this emotion data and collected information to create audio data using a generative AI model. The tone and pace of the voice are adjusted to suit the user's emotional state. Furthermore, background sounds and jingles are added to the audio data to enhance the listening experience.
[0806] The final audio data is sent to the user's device and automatically played at the specified time. For example, if a user requests to "listen to the morning news for 10 minutes," the server will collect the latest news, and if the emotion engine determines that the user is in relaxation mode, it will generate audio in a calm tone. Prompts may include phrases such as, "Generate an appropriate audio tone when the user is seeking relaxation," or "Create optimal audio information based on the user's past preference history and current mood."
[0807] This system utilizes conventional computers and smart devices as hardware, and leverages speech recognition software and generative AI models as software. As a result, users can acquire information on a daily basis that is more emotionally rich and tailored to their needs, thereby improving their quality of life.
[0808] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0809] Step 1:
[0810] The user accesses the application and enters their preferred information genre and desired playback time. This information is received by the server. The input here consists of "information genre" and "desired playback time," and the server prepares to collect information based on this. Specifically, by entering "economic news" as the information genre and "7 AM" as the desired playback time, information collection for that time period will begin.
[0811] Step 2:
[0812] The server collects relevant information from internet resources and databases based on the received information genre. It generates queries for information collection and performs database access and web crawling. The input is the "information genre," and the output is a "set of related information." For example, it might collect the latest articles related to "economic news."
[0813] Step 3:
[0814] The server analyzes and summarizes the collected relevant information using natural language processing techniques. It uses an information processing engine to summarize the input "set of relevant information" and output simplified "summary information." Specifically, it uses a text summarization algorithm to extract only the most important news points.
[0815] Step 4:
[0816] The device uses an emotion engine to recognize the user's real-time emotional state. It analyzes voice input and interactions to determine the emotional state. The input is "user voice data" and "interaction data," and the output is "the user's emotional state." Specifically, it uses speech recognition and machine learning models to determine the user's emotions from the tone of their voice and interactions.
[0817] Step 5:
[0818] The server generates audio data using a generative AI model based on the analyzed user's emotional state and summary information. The input is "summary information" and "user's emotional state," and the output is "audio data." Specifically, it generates audio using speech synthesis technology with a tone and pace that corresponds to the emotion.
[0819] Step 6:
[0820] The server performs final adjustments to the generated audio data by adding background sounds and jingles. The input here is the "audio data," and the output is the "adjusted audio data." Specifically, it selects background sounds from a pre-prepared audio library and overlays them onto the audio data.
[0821] Step 7:
[0822] The server sends the adjusted audio data to the user's terminal. The input is the "adjusted audio data," and the output is the "transfer of audio data to the user's terminal." The specific operation is the procedure of sending data over a network.
[0823] Step 8:
[0824] The user receives information by playing audio data at a time set on their device. The input is "audio data to the user's device," and the output is "audio playback." In terms of operation, the audio will play automatically at the specified time.
[0825] Through this series of steps, the system can provide personalized information tailored to the user's needs and emotions.
[0826] (Application Example 2)
[0827] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0828] Conventional voice information systems have struggled to provide information tailored to the user's emotional state, often resulting in a lack of self-optimized information. As a result, users sometimes feel that the information they receive is not sufficiently personalized. In particular, providing voice information in an appropriate tone according to emotions is currently lacking in current technology.
[0829] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0830] In this invention, the server includes means for receiving the type of information of interest and the desired playback time from the user, means for collecting information from data storage devices and network resources based on the received information type, and means for analyzing and summarizing the collected information. This makes it possible to provide personalized audio information that corresponds to the user's emotional state.
[0831] A "user" refers to an individual or organization that uses an information system or device to receive services or data.
[0832] "Types of Interested Information" refers to a classification of the categories and fields of information that a user is interested in.
[0833] "Desired playback time" refers to a specific time slot or date when a user wishes to play information or content.
[0834] A "data storage device" refers to a device or medium that stores large amounts of data and allows access and retrieval of it as needed.
[0835] "Network resources" refer to data, information, and services available on the internet and other networks.
[0836] "Methods for analyzing and summarizing information" refers to techniques for analyzing large amounts of information data, extracting key points and important elements, and making them concise.
[0837] "Audio format" refers to a format in which digital or analog data is converted into an audio signal and made playable.
[0838] "User equipment" refers to electronic devices such as computers, smartphones, and tablets that are used directly by the user.
[0839] An "emotion analysis device" refers to a device or software that analyzes a user's emotions and psychological state and makes that data available for use.
[0840] "Adjusting the tone of an audio format" refers to changing the pitch, speed, intonation, etc., of audio data to express specific emotions or moods.
[0841] The system for implementing this invention consists of a server and a user terminal working together. The server receives the type of information of interest and the desired playback time from the user. Based on this, it collects relevant information from data storage devices and network resources. Specifically, it analyzes and summarizes the information using natural language processing technologies such as Google Cloud and IBM Watson. When this summarized information is converted into audio format and sent to the user terminal, the system analyzes the user's emotions using an emotion analysis device such as Microsoft Azure's Emotion API. The tone of the audio format is then adjusted to enable the provision of appropriate information according to the emotion.
[0842] Users receive audio information via their smartphones or tablets, timed to their preferred schedule. This information is personalized according to the user's emotional state, providing a more intuitive and satisfying listening experience. For example, a feature is implemented that plays uplifting audio during the morning commute. The system applies automated learning technology based on user feedback to optimize information delivery. In the future, further personalization using more sophisticated generative AI models is being considered.
[0843] For example, if a user is determined to be in a relaxed mood on a Monday morning, they can receive audio content that includes calming background music along with the day's news and weather information. This allows users to access information optimized for their emotions, making daily information acquisition more comfortable. By applying the prompt, "Use a generative AI model to suggest audio content best suited to the user's emotions," AI-driven data processing becomes possible.
[0844] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0845] Step 1:
[0846] The user enters their type of interest and desired playback time via a terminal. The terminal sends this input to the server. The server uses the type of interest and desired playback time received from the user to prepare to access data storage devices and network resources to retrieve the necessary information.
[0847] Step 2:
[0848] The server collects data related to the user's specified type of information interest from data storage devices and network resources. The collected data is analyzed using natural language processing techniques to generate a summary. The input is the user-specified type of information, and the output is the summarized information data.
[0849] Step 3:
[0850] The server converts summarized information into audio format. The input is summarized text information, and the output is audio data. At this time, an emotion analysis device is used to analyze the user's emotional state, and the tone of the audio data is adjusted based on the results. The results of the emotion analysis are used as parameters for adjusting the audio tone.
[0851] Step 4:
[0852] The server sends the adjusted audio data to the user terminal. The input is tone-adjusted audio data, and the output is the transmission of data to the user terminal.
[0853] Step 5:
[0854] The user's device plays the received audio data at the specified playback time. The audio data being played may include background sounds and jingles to enhance the user's listening experience.
[0855] Step 6:
[0856] After listening, users send feedback to the server regarding their emotional state and satisfaction with the information. The server receives this feedback and uses machine learning algorithms to optimize information delivery. The feedback is used as data to improve future information delivery.
[0857] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0858] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0859] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.
[0860] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.
[0861] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.
[0862] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.
[0863] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.
[0864] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.
[0865] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."
[0866] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.
[0867] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.
[0868] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.
[0869] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.
[0870] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.
[0871] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.
[0872] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.
[0873] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.
[0874] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.
[0875] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.
[0876] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.
[0877] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.
[0878] The following is further disclosed regarding the embodiments described above.
[0879] (Claim 1)
[0880] A means of receiving information genres of interest and desired playback time from users,
[0881] A means of collecting information from databases and internet resources based on the received information genre,
[0882] A means of analyzing and summarizing the collected information,
[0883] A means of converting summarized information into audio data,
[0884] A means of sending the created audio data to the user's terminal at a set time,
[0885] A means for playing back transmitted audio data,
[0886] A means of applying machine learning to receive user feedback and optimize information provision,
[0887] A system that includes this.
[0888] (Claim 2)
[0889] The system according to claim 1, further comprising means for automatically adjusting the selection of information to be collected using the user's past preference history.
[0890] (Claim 3)
[0891] The system according to claim 1, comprising means for adding background sounds and jingles to audio data.
[0892] "Example 1"
[0893] (Claim 1)
[0894] A means of receiving information areas of interest and desired playback time from the user,
[0895] A means for collecting information from data stores and communication networks based on the received information field,
[0896] A means of using natural language processing techniques to process and summarize the collected information,
[0897] A means of converting summarized information into audio data using speech synthesis technology,
[0898] Methods for adding background sounds and sound effects to audio data,
[0899] A means of sending the created audio data to the user's terminal at a set time,
[0900] A means for playing back transmitted audio data,
[0901] A means of applying an automated learning algorithm to receive user feedback and optimize information provision,
[0902] A system that includes this.
[0903] (Claim 2)
[0904] The system according to claim 1, further comprising means for automatically adjusting the selection of information to be collected using the user's past preference history.
[0905] (Claim 3)
[0906] The system according to claim 1, comprising means for applying specific prompt sentences for summarizing information using a generative AI model.
[0907] "Application Example 1"
[0908] (Claim 1)
[0909] A means of receiving information genres of interest and desired playback time from users,
[0910] A means of collecting information from information sources based on the genre of information received,
[0911] A means of analyzing and summarizing the collected information,
[0912] A means of converting summarized information into audio data,
[0913] A means of adding background sound to the created audio data and sending it to the user's terminal at a set time,
[0914] A means for playing back transmitted audio data,
[0915] A means of receiving user feedback and applying a learning algorithm to optimize information provision,
[0916] A means of delivering information in multiple media formats based on input of interest genres and desired playback time,
[0917] A system that includes this.
[0918] (Claim 2)
[0919] The system according to claim 1, further comprising means for adjusting the selection of information to be collected using the user's past preference history.
[0920] (Claim 3)
[0921] The system according to claim 1, comprising means for adding background sounds and sound effects to audio data.
[0922] "Example 2 of combining an emotion engine"
[0923] (Claim 1)
[0924] A means of receiving information genres and desired playback times from users,
[0925] Based on the genre of information received, means of collecting information from information resources,
[0926] A means of analyzing and summarizing collected information using natural language processing technology,
[0927] A means of generating voice data using an AI model based on summarized information and the user's emotional state,
[0928] A means of adjusting the generated audio data and adding background sounds and jingles,
[0929] A means for transmitting the created audio data to a communication device at a set time,
[0930] A means for playing back transmitted audio data,
[0931] A means of using an emotion engine to recognize the user's emotional state and adjust the audio data,
[0932] A means of applying machine learning to receive user feedback and optimize information provision,
[0933] A system that includes this.
[0934] (Claim 2)
[0935] The system according to claim 1, comprising means for automatically adjusting the selection of information to be collected using the user's past preference history and sentiment history.
[0936] (Claim 3)
[0937] The system according to claim 1, further comprising means for adjusting the tone of the generated audio data to match the user's emotional state.
[0938] "Application example 2 when combining with an emotional engine"
[0939] (Claim 1)
[0940] A means of receiving the type of information of interest from the user and the desired playback time,
[0941] A means for collecting information from data storage devices and network resources based on the type of information received,
[0942] A means of analyzing and summarizing the collected information,
[0943] A means of converting summarized information into audio format,
[0944] A means for transmitting the created audio format to the user's device at a set time,
[0945] A means of playing back the transmitted audio format,
[0946] Means including an emotion analysis device for recognizing the user's emotional state and adjusting the tone of the voice format,
[0947] A means of applying automated learning to receive user responses and optimize information provision,
[0948] A system that includes this.
[0949] (Claim 2)
[0950] The system according to claim 1, further comprising means for automatically adjusting the selection of information to be collected using the user's past preference history and emotional history.
[0951] (Claim 3)
[0952] The system according to claim 1, comprising means for adding background sounds and jingles to the audio format. [Explanation of symbols]
[0953] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>
Claims
1. A means of receiving information genres of interest and desired playback time from users, A means of collecting information from databases and internet resources based on the received information genre, A means of analyzing and summarizing the collected information, A means of converting summarized information into audio data, A means of sending the created audio data to the user's terminal at a set time, A means for playing back transmitted audio data, A means of applying machine learning to receive user feedback and optimize information provision, A system that includes this.
2. The system according to claim 1, further comprising means for automatically adjusting the selection of information to be collected using the user's past preference history.
3. The system according to claim 1, comprising means for adding background sounds and jingles to audio data.