system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A system using natural language processing and AI generates personalized music and albums based on user emotions and images, addressing the need for easy music recording and preservation of personal experiences.

JP2026100737APending Publication Date: 2026-06-19SOFTBANK GROUP CORP

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: SOFTBANK GROUP CORP
Filing Date: 2024-12-09
Publication Date: 2026-06-19

Application Information

Patent Timeline

09 Dec 2024

Application

19 Jun 2026

Publication

JP2026100737A

IPC: G06F3/01; G10L13/02; G10L21/028; G10L19/02; G10L19/00; G10L13/04; G10L25/48; G10L13/08; G10L13/00; G10L13/06; G10L25/00; G10L19/16

AI Tagging

Application Domain

Input/output for user-computer interaction Graph reading

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

There is a lack of systems that allow ordinary users to easily record their daily feelings and special moments in the form of personalized music without requiring specialized music production knowledge.

Method used

A system that uses natural language processing and generation AI to identify emotion categories from user input, generate lyrics and music, and integrate them with image information to create personalized digital albums.

Benefits of technology

Enables users to easily record and preserve important moments through music, providing a means to emotionally and visually capture their experiences.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 2026100737000001_ABST

Patent Text Reader

Abstract

Provide a system. 【Solution means】 Means for receiving emotional information, Means for receiving image information, Means for analyzing the emotional information and identifying an emotion category, Means for generating lyrics based on the emotion category, Means for generating music based on the emotion category, Means for integrating the generated lyrics and music to create an album, Means for incorporating image information into the album, A system including the above.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance as a response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] There is a problem that there is a lack of means for ordinary users to easily record their daily feelings and special moments and comprehensively save them in the form of music. Conventionally, there has been no system that can easily generate personalized music for individuals without requiring specialized music production knowledge and technology. In such a situation, there is a demand for providing a system that enables users to easily record their feelings as music.

Means for Solving the Problems

[0005] This invention provides a system that identifies emotion categories using natural language processing technology based on emotion and image information, and automatically generates lyrics and music based on those categories. Specifically, it includes means for analyzing emotion information input by the user to identify emotion categories and generating lyrics and music suitable for those emotion categories using generation AI technology. Furthermore, by integrating the generated lyrics and music and creating an album that includes image information provided by the user, it enables easy recording and playback of a personalized music history. In this way, it makes it easy to record and save individual emotions and special moments together with music.

[0006] "Emotional information" refers to data entered by the user that indicates their emotions and mood at any given time.

[0007] "Image information" refers to photographic data provided by the user to the system for visually recording specific moments or situations.

[0008] An "emotional category" is a category that represents a specific emotional state, classified based on analyzed emotional information.

[0009] "Means of generating lyrics" refers to an algorithm or process for automatically generating lyrics that match the music by selecting appropriate words based on emotional categories.

[0010] "Means of generating music" refers to a method for automatically generating melodies and rhythms based on emotional categories and associated musical genres.

[0011] "The means of creating an album" refers to the process of integrating the generated lyrics and music, and bringing them together as a single package, along with the provided image information. [Brief explanation of the drawing]

[0012] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14] This is a sequence diagram showing the processing flow of the data processing system in Application Example 2, which combines an emotion engine. [Modes for carrying out the invention]

[0013] Hereinafter, an example of an embodiment of the system relating to the technology of this disclosure will be described with reference to the attached drawings.

[0014] First, the terms used in the following description will be explained.

[0015] In the following embodiments, a tagged processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0016] In the following embodiments, a tagged RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0017] In the following embodiments, a tagged storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.

[0018] In the following embodiments, a tagged communication I / F (Interface) is an interface that includes a communication processor and an antenna, etc. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark), and the like.

[0019] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0020] [First Embodiment]

[0021] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0022] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0023] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0024] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0025] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0026] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0027] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0028] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0029] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0030] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0031] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0032] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0033] The present invention's system analyzes emotional information based on user input of emotions and images as visual records, and generates personalized digital albums with music. Users access the system via a terminal and input emotions and photos according to their intention to record their experiences. This allows for the creation of unique musical works based on the user's memories.

[0034] When a user enters information, the device sends that data to the server. The server receives this information and identifies the emotion category by analyzing the text data using natural language processing technology. Once the emotion category is identified, the server uses generative AI to generate highly relevant lyrics. Simultaneously, the server uses music generation technology to generate music corresponding to the selected emotion category. This process is carried out while taking into account the characteristics of the chosen music genre.

[0035] Once the generated lyrics and music are complete, the server integrates them and combines them with image information to create a digital album. The final album is sent to the user's device, where the user can view, save, or share it with others. The album created in this process will have high emotional value for the user, providing a means to record and preserve important moments in life through music.

[0036] For example, if a user wants to record a memorable moment at their graduation ceremony, they would input the emotion as "memorable" and upload a graduation photo to the system. In this case, the server would generate lyrics based on the emotion category "memorable," select a classical music genre, and create music. Afterwards, the user would be provided with a completed album containing their photo. In this way, users can record and reminisce about important moments through music.

[0037] The following describes the processing flow.

[0038] Step 1:

[0039] The user inputs data into the device through a dedicated interface to enter emotions and related photos. The device receives this user input and sends it to the server as formatted data.

[0040] Step 2:

[0041] The server receives emotion and image information transmitted from the terminal. It performs validation to ensure the data is correctly interpretable, particularly verifying that the emotion information is in the expected format.

[0042] Step 3:

[0043] The server analyzes the received sentiment information using natural language processing (NLP) techniques. This identifies the main sentiment themes from the input text and assigns corresponding sentiment categories.

[0044] Step 4:

[0045] The server utilizes a generative AI model based on identified emotion categories to automatically generate lyrics that match the user's emotions. This process pays attention to the nuances and message of the words.

[0046] Step 5:

[0047] The server determines the emotion category and associated music genre, and then executes a music generation algorithm based on that. This creates an original melody and accompaniment that harmonizes with the emotion.

[0048] Step 6:

[0049] The server integrates the generated lyrics and music, and further adds user-provided image information as visual elements of the album. This creates a digital album that appeals to the user both visually and aurally.

[0050] Step 7:

[0051] The server sends the completed album to the device. The device displays the received album data so that the user can view it and supports actions such as saving and sharing.

[0052] (Example 1)

[0053] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0054] In today's information society, there is a growing need to express and preserve personal memories and emotions in a richer way. However, there is a lack of readily available systems that allow users to easily generate content that reflects their emotions at any given time and save it in a personalized manner. Therefore, there is a need for means to express users' own emotions and memories in an integrated visual and auditory way.

[0055] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0056] In this invention, the server includes a device for inputting emotion data, a device for inputting image data, and a processing device for analyzing the emotion data and identifying emotion categories. This makes it possible to generate content based on the individual emotions of the user.

[0057] A "device for inputting emotional data" is a device that provides an interface for users to input their own emotions in text or voice.

[0058] A "device for inputting image data" is a device that has an interface for users to upload images and photographs.

[0059] A "processing device that analyzes emotional data and identifies emotional categories" is a device that analyzes input emotional data using natural language processing technology and other methods to determine a specific emotional category.

[0060] A "processing device that generates lyrics using generative AI technology" is a device that generates prompts based on specific emotional categories, inputs them into a generative AI model, and automatically generates lyrics.

[0061] A "processing device that generates music using a music generation algorithm" is a device that has the function of selecting a suitable musical style from a specific emotional category and creating music based on that style.

[0062] A "processing device that integrates generated lyrics and music to create a digital album" is a device that combines generated lyrics and music, and further integrates them with image data to create a digital album.

[0063] A "processing device for incorporating image data into a digital album" is a device that integrates image data provided by the user along with generated music and lyrics, and incorporates visual elements into the completed digital content.

[0064] This invention is a system that generates personalized digital albums based on the user's emotions and images. First, the user inputs their emotions as text and uploads images to the system via a terminal. The terminal sends the input data to a server, where the data is analyzed and processed.

[0065] The server analyzes the received emotional data using natural language processing technologies (such as NLTK and SpaCy) to identify specific emotional categories. Based on these analysis results, it generates lyrics using generative AI technology. Specifically, prompt sentences are created as input to the generative AI model, giving instructions to the AI in the form of, for example, "Please generate lyrics that reflect deeply moving emotions."

[0066] Next, the server uses a music generation algorithm to create music corresponding to the identified emotion category. For example, it might select a classical music genre to match the emotion of "emotional," and then generate a melody using a music generation tool. Specific music generation tools used include music generation software.

[0067] Furthermore, the server integrates the generated lyrics and music, and combines them with images uploaded by the user. This creates a digital album in which visual information, music, and lyrics are unified. This album is rendered using a multimedia editing library such as FFmpeg.

[0068] The completed digital album is sent from the server to the user's device, where the user can view, save, and even share it with other users. For example, if a user wants to record a "memorable" moment at their graduation ceremony, an album will be generated that combines lyrics based on those emotions with classical music. This system allows users to emotionally record important moments and easily relive them.

[0069] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0070] Step 1:

[0071] The user inputs emotion data and image data through their device. Specifically, they input emotions in text format and upload image files. The system receives text data of emotions and image files as input.

[0072] Step 2:

[0073] The terminal sends the input emotion data and image data to the server. The data is transmitted in a secure format using the HTTP protocol. The output of this step is the emotion text and image data received by the server.

[0074] Step 3:

[0075] The server analyzes the received sentiment data using natural language processing (NLP) techniques to identify sentiment categories. This process utilizes NLP libraries, such as NLTK or SpaCy, to analyze the input text data. The output of this analysis is the identified sentiment category.

[0076] Step 4:

[0077] The server generates lyrics using a generative AI model based on the identified emotion categories. To do this, it creates a prompt and sends it to the model. An example of a prompt is "Please generate lyrics that express a deeply moving emotion." The output of the AI processing is the lyrics generated based on the emotion.

[0078] Step 5:

[0079] The server utilizes music generation algorithms to produce music corresponding to emotional categories. For example, it might generate classical music corresponding to the emotion of "deep emotion." The use of a music generation tool is required, and the output is the generated music data.

[0080] Step 6:

[0081] The server integrates the generated lyrics and music and combines them with image data. This integration process utilizes multimedia editing libraries such as FFmpeg. The output is a completed digital album.

[0082] Step 7:

[0083] The server sends the completed digital album to the device. The user can view, save, and share the album with other users via the device. The output is the completed album data accessible to the user.

[0084] (Application Example 1)

[0085] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0086] A challenge lies in the lack of means to emotionally and visually record important family and personal moments and memories. Furthermore, there is a need to ensure that recorded digital content is preserved not merely as data, but in a way that imbues it with emotional value.

[0087] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0088] In this invention, the server includes means for receiving emotional information, means for receiving image information, and means for converting speech to text. This enables the automatic generation of digital albums with music based on the user's emotions.

[0089] "Means for receiving emotional information" refers to a device or method for electronically receiving emotional data entered by a user.

[0090] "Means for receiving image information" refers to a device or method for receiving and recording image data provided by a user.

[0091] "Means for converting speech to text" refers to a technology or device that analyzes speech data and converts it into corresponding text data.

[0092] "Means for analyzing emotional information and identifying emotional categories" refers to a technology or process for analyzing input emotional data and classifying it into a specific emotional category.

[0093] "Means for generating lyrics" refers to a technology or device for automatically producing appropriate lyrics based on a specific emotional category.

[0094] "Means for generating music" refers to technologies or devices for composing or synthesizing music that corresponds to emotional categories.

[0095] "Means for integrating generated lyrics and music to create an album" refers to a technology or method for combining created lyrics and music and structuring them in an album format along with visual elements.

[0096] "Means for incorporating image information into an album" refers to a method or apparatus for incorporating user-provided images into a digital album to create a completed album.

[0097] To implement this invention, the user begins by inputting emotional and image information into the system via a terminal. The terminal receives voice input and utilizes speech recognition technology to convert the speech into text. This technology utilizes the Google® Cloud Speech-to-Text API.

[0098] Data sent from the terminal is aggregated on the server. The server analyzes the received text data using natural language processing technology and identifies the emotion category. Python and the natural language processing library NLTK are used for the analysis. Once the emotion category is identified, the server automatically generates the corresponding lyrics using a generative AI model. OpenAI's GPT model and others are used for lyric generation.

[0099] Next, the server generates appropriate music using music generation technology. Music generation libraries such as Magenta are useful for genre selection. This creates music that matches the specified emotional category.

[0100] The generated lyrics and music are integrated on a server, and combined with image information to complete the digital album. Media manipulation libraries such as FFmpeg are used for this integration. Finally, the album is sent to the user's device, where it can be viewed or shared.

[0101] For example, if a user inputs something like, "I want to record the fun memories of the family picnic I had today," the audio is converted to text, and the emotion category "fun" is identified. Based on this, cheerful and upbeat lyrics and music are generated, and a digital album is created along with photos. A prompt like the following might be used: "I had a picnic with my family today. It was so much fun. I want to record those memories. The emotion category is 'fun,' and I want lyrics and music that convey the feeling of the picnic."

[0102] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0103] Step 1:

[0104] The device receives voice input from the user. The user provides image information they want to record along with a specific emotion. The voice data is converted into text data using the Google Cloud Speech-to-Text API. The input is voice data, and the output is text data.

[0105] Step 2:

[0106] Text data is sent to a server, which then uses natural language processing techniques to analyze it. Python and the NLTK library are utilized to identify sentiment categories. The input is text data, and the output is sentiment categories.

[0107] Step 3:

[0108] The server generates relevant lyrics based on emotion categories using an OpenAI GPT model. The input is an emotion category, and the output is the generated lyrics. In this process, the generative AI model understands the emotion category used as a prompt and creates appropriate lyrics.

[0109] Step 4:

[0110] The server generates music based on emotion categories. It uses music generation libraries such as Magenta to compose music that matches the emotion. The input is the emotion category, and the output is the generated music.

[0111] Step 5:

[0112] The generated lyrics and music are integrated on the server. Image information is also incorporated as part of the album. Media manipulation libraries such as FFmpeg are used to integrate them into a slideshow format. The input consists of lyrics, music, and image information, and the output is a digital album.

[0113] Step 6:

[0114] The server sends the completed digital album to the user's device. The user can view and save the digital album on their device. The input is the digital album, and the output is the album displayed on the user's device.

[0115] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0116] The present invention aims to recognize a user's emotions in more detail and generate music that aligns with those emotions. To this end, the system incorporates an emotion engine that can recognize emotions by analyzing not only the emotional information input by the user, but also the user's facial expressions, voice tone, and biometric information. Because this emotion engine handles multiple sensory input data, it achieves highly accurate and reliable emotion recognition.

[0117] Users access the system using their devices and input their emotions and photos. Furthermore, the emotion engine uses the camera and microphone to capture and analyze the user's facial expressions and voice in real time. The device then sends this data to a server. The server aggregates and processes the transmitted data, using natural language processing techniques and machine learning algorithms to identify emotion categories. This process makes it possible to create music that takes into account not only the user's temporary emotions but also emotional changes and complex emotional patterns.

[0118] The server uses a generative AI to create appropriate lyrics based on the emotion category identified by the emotion engine. The music generation process considers the emotion category and corresponding music genre to generate a melody and rhythm that best fits the current emotion. The generated content is integrated into a single album along with image information provided by the user.

[0119] For example, if a user feels happy at a sporting event, the emotion engine analyzes not just the emotion of "happiness," but also the smile on their face and the excited tone of their voice, which are expressed in real time. As a result, the server generates a music album containing lively pop music and lyrics like an anthem, and provides it to the user's device. In this way, it is possible to create a music album that is more deeply intertwined with the user's experience and emotions.

[0120] This invention enhances the user experience and enables music generation that closely responds to individual emotions. This system enriches everyday emotions and allows for more immersive memory recording.

[0121] The following describes the processing flow.

[0122] Step 1:

[0123] The user accesses the system using a device. The user inputs photos related to their emotions and the situation at the time, and activates the camera and microphone as needed. The device sends the input text data and photo data to the server.

[0124] Step 2:

[0125] With the user's permission, the device captures the user's facial expressions and voice tone in real time via the camera and microphone. This data is processed by an emotion engine for emotion recognition and then sent to a server.

[0126] Step 3:

[0127] The server analyzes incoming text, photos, and real-time data from the emotion engine. It uses natural language processing techniques to analyze text data and identify emotion categories. It also extracts additional emotion information from facial expression data and voice tone to determine an overall emotion category.

[0128] Step 4:

[0129] The server uses AI to automatically generate appropriate lyrics based on identified emotion categories. At this stage, appropriate words that fit the emotional tone are selected.

[0130] Step 5:

[0131] The server takes into account the emotion category and selected music genre, and uses music generation AI to create original melodies and accompaniments. The generated melodies are designed to best harmonize with the user's emotions.

[0132] Step 6:

[0133] The server integrates the generated lyrics and music, and creates a visual album that includes photos submitted by the user. This results in a digital content that integrates auditory and visual elements.

[0134] Step 7:

[0135] The server sends the completed album to the user's device. The device then displays this digital album so that the user can view, save, or share it with other users.

[0136] (Example 2)

[0137] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0138] Traditional music generation systems evaluate user emotions based on simple text input, making it difficult to generate personalized music that reflects subtle nuances and changes in those emotions. This limits the user experience and fails to provide a musical environment that fully resonates with their feelings.

[0139] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0140] In this invention, the server includes means for receiving emotional information, means for capturing facial expression data and recording voice tone, means for analyzing the emotional information, facial expression data, and voice tone to identify an emotional category, and means for generating lyrics based on the emotional category using a generative AI model. This makes it possible to recognize the user's diverse and subtle emotions with high accuracy and generate music and lyrics individually to match them.

[0141] "Emotional information" refers to input data in which a user expresses their internal emotional state in text or other forms.

[0142] "Image information" refers to images and photographic data related to the user, and is used as an element in interpreting emotions.

[0143] "Facial expression data" refers to data about the user's facial expressions captured through a camera.

[0144] "Voice tone" refers to data related to audio information obtained from the user's speaking style and voice quality.

[0145] An "emotional category" is a classification that indicates the emotional state of a user, identified from the analyzed emotional information.

[0146] A "generative AI model" is an artificial intelligence system used to generate music and lyrics based on emotional information.

[0147] "Musical style" refers to a specific musical genre or characteristic that corresponds to an emotional category.

[0148] This invention relates to a system that analyzes a user's emotions in a multidimensional manner and generates music based on that analysis. The user inputs their emotional information into the system through a terminal, and also provides facial expression data and voice tone using a camera and microphone. The terminal collects this data and transmits it to a server.

[0149] The server analyzes the received data using an emotion engine. Natural language processing techniques and machine learning algorithms are employed to identify emotion categories from the input emotion information, facial expression data, and voice tone. These emotion categories reflect specific emotional states (e.g., "joy," "excitement").

[0150] Next, the server uses a generative AI model to generate lyrics and music based on this emotion category. The generation prompt is an instruction to the AI such as, "Create a song that matches the user's emotions." The generative AI model considers the musical style corresponding to the emotion category and determines the melody and rhythm of the music.

[0151] The generated music and lyrics are integrated and combined with user-provided image information to create a digital album. This album is delivered to the user's device, providing an emotionally rich experience through both sight and sound. For example, if a user feels "excitement" at a sporting event, the server creates an energetic pop song and anthem-like lyrics, which are then delivered as a music album.

[0152] This system generates personalized music content in real time that is fully responsive to the user's emotions, allowing them to enjoy a deeper emotional experience.

[0153] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0154] Step 1:

[0155] Users access the system using a device and input their emotions as text or upload photos. The device collects data by capturing the user's facial expressions with its camera and recording their voice tone with its microphone. This process yields emotional information, facial expression data, and voice data as input data. This input data is packaged and sent to the server.

[0156] Step 2:

[0157] The server receives emotional information, facial expression data, and voice data transmitted from the terminal. This data is analyzed by an emotion engine to comprehensively evaluate the user's emotions. This analysis uses natural language processing techniques to extract emotions from text and machine learning algorithms to identify emotion categories from facial expression and voice data. The analysis results output an emotion category (e.g., "joy," "excitement").

[0158] Step 3:

[0159] The server generates appropriate lyrics using a generative AI model based on the emotion categories identified through analysis. Here, a prompt sentence based on the emotion category (e.g., "Create a song that matches the user's feelings of joy") is supplied to the generative AI model. Based on this prompt, the AI model generates appropriate lyrics, and those lyrics are output.

[0160] Step 4:

[0161] The server generates music based on the same emotion category. The generation AI model links emotion categories with musical styles to create the most appropriate melody and rhythm. This process outputs a music file that matches the emotion.

[0162] Step 5:

[0163] The server integrates the generated lyrics and music, thereby forming the music content. Simultaneously, it integrates image information provided by the user to create a personalized digital album. This album data becomes the final output.

[0164] Step 6:

[0165] The server delivers a personalized digital album to the user's device. The user can then play this digital album and enjoy their emotional experience visually and aurally.

[0166] (Application Example 2)

[0167] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0168] This invention aims to improve the user experience by generating music based on the user's emotions. However, existing systems have difficulty precisely identifying the user's facial expressions and tone of voice and providing music that matches their emotions in real time. Furthermore, these systems lack means to deepen interaction with the user when the robot plays music.

[0169] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0170] In this invention, the server includes means for receiving emotional information, means for receiving image information, and means for receiving audio information. This makes it possible to recognize the user's complex emotions with high accuracy and generate music and lyrics that match those emotions. Furthermore, by having the robot generate actions in accordance with the generated music, it becomes possible to enrich the interaction with the user.

[0171] "Emotional information" refers to data about emotions obtained based on the user's facial expressions, tone of voice, and other biometric data.

[0172] "Image information" refers to digital image or video data that includes the user's visual characteristics and environmental information.

[0173] "Audio information" refers to audio data that includes the tone and rhythm of the user's voice.

[0174] An "emotion category" is a type of emotion classified based on analyzed emotional information.

[0175] "Methods for generating lyrics" refer to techniques for constructing appropriate words and phrases based on emotional categories.

[0176] "Means of generating music" refers to techniques for creating melodies and rhythms that correspond to emotional categories.

[0177] "The means of creating an album" refers to the technology that integrates the generated lyrics and music into a single cohesive whole.

[0178] "Means for generating motion" refers to the technology for designing robot movements that are synchronized with music.

[0179] The system used to realize this application features the ability to recognize the user's emotions in detail and generate music accordingly. Users access the system via a smartphone or home robot terminal and input their emotions and image information. The terminal uses a camera and microphone to capture the user's facial expressions and voice tone in real time. This data is then transmitted from the terminal to a server.

[0180] The server analyzes the emotion category based on the received emotion and voice information. Natural language processing technology and emotion analysis algorithms are used for the analysis. This analysis makes it possible to recognize the complex patterns of the user's emotions with high accuracy.

[0181] Furthermore, the server uses an AI model to generate appropriate lyrics based on the emotional category and then produces music. The music genre is automatically selected based on the emotional category. The generated lyrics and music are integrated into a single album and provided to the user along with image information.

[0182] In addition, the home robot generates movements that match the music being played. The robot can move in sync with the generated music to provide an immersive experience similar to a live performance.

[0183] As a concrete example, imagine a scenario where, upon a user returning from work, the system recognizes their tired expression, generates relaxing music and lyrics, and a household robot performs soothing movements in time with the music. In this way, the individual user experience can be improved.

[0184] An example of a prompt to input into a generative AI model is: "Build a neural network that generates optimal relaxation music based on the user's emotions. Take into account the user's facial expressions and tone of voice."

[0185] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0186] Step 1:

[0187] The device uses a camera and microphone to capture the user's facial expressions and voice. The input data consists of image and audio information, which is collected in real time. This allows the device to acquire biometric information such as the user's facial expressions and voice tone.

[0188] Step 2:

[0189] The device sends the acquired image and audio information to the server as a dataset. The input here is the unprocessed image and audio data sent from the device. The server receives this and passes it on to the next analysis step.

[0190] Step 3:

[0191] The server analyzes the received image and audio information to extract emotional information. Specifically, it uses natural language processing and facial recognition algorithms to identify emotional categories. The input is image and audio data, and the output is emotional categories. This operation identifies the emotional state of the user.

[0192] Step 4:

[0193] The server generates lyrics using a generative AI model based on emotion categories. Emotion categories are used as input, and corresponding lyric data is generated as output. The AI model creates linguistic expressions appropriate to the emotions expressed.

[0194] Step 5:

[0195] The server similarly generates music based on emotion categories. The input is an emotion category, and the output is a corresponding music track. The music generation algorithm automatically creates melodies and rhythms that match the emotion.

[0196] Step 6:

[0197] The server integrates the generated lyrics and music to create an album. Image information is also incorporated, completing a single listening experience. The inputs here are lyrics, music, and image information, and the output is the integrated album. This integration enables a personalized music experience for each user.

[0198] Step 7:

[0199] The robotic device plays the music from the album and generates actions. Emotional information received as user input is reused here, and the robot performs interactive actions synchronized with the music. This synchronizes the music and movement, providing a more immersive experience.

[0200] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0201] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search)<url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0202] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0203] [Second Embodiment]

[0204] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0205] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0206] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0207] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0208] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0209] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0210] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0211] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0212] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0213] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0214] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0215] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0216] The present invention's system analyzes emotional information based on user input of emotions and images as visual records, and generates personalized digital albums with music. Users access the system via a terminal and input emotions and photos according to their intention to record their experiences. This allows for the creation of unique musical works based on the user's memories.

[0217] When a user enters information, the device sends that data to the server. The server receives this information and identifies the emotion category by analyzing the text data using natural language processing technology. Once the emotion category is identified, the server uses generative AI to generate highly relevant lyrics. Simultaneously, the server uses music generation technology to generate music corresponding to the selected emotion category. This process is carried out while taking into account the characteristics of the chosen music genre.

[0218] Once the generated lyrics and music are complete, the server integrates them and combines them with image information to create a digital album. The final album is sent to the user's device, where the user can view, save, or share it with others. The album created in this process will have high emotional value for the user, providing a means to record and preserve important moments in life through music.

[0219] For example, if a user wants to record a memorable moment at their graduation ceremony, they would input the emotion as "memorable" and upload a graduation photo to the system. In this case, the server would generate lyrics based on the emotion category "memorable," select a classical music genre, and create music. Afterwards, the user would be provided with a completed album containing their photo. In this way, users can record and reminisce about important moments through music.

[0220] The following describes the processing flow.

[0221] Step 1:

[0222] The user inputs data into the device through a dedicated interface to enter emotions and related photos. The device receives this user input and sends it to the server as formatted data.

[0223] Step 2:

[0224] The server receives emotion and image information transmitted from the terminal. It performs validation to ensure the data is correctly interpretable, particularly verifying that the emotion information is in the expected format.

[0225] Step 3:

[0226] The server analyzes the received sentiment information using natural language processing (NLP) techniques. This identifies the main sentiment themes from the input text and assigns corresponding sentiment categories.

[0227] Step 4:

[0228] The server utilizes a generative AI model based on identified emotion categories to automatically generate lyrics that match the user's emotions. This process pays attention to the nuances and message of the words.

[0229] Step 5:

[0230] The server determines the emotion category and associated music genre, and then executes a music generation algorithm based on that. This creates an original melody and accompaniment that harmonizes with the emotion.

[0231] Step 6:

[0232] The server integrates the generated lyrics and music, and further adds user-provided image information as visual elements of the album. This creates a digital album that appeals to the user both visually and aurally.

[0233] Step 7:

[0234] The server sends the completed album to the device. The device displays the received album data so that the user can view it and supports actions such as saving and sharing.

[0235] (Example 1)

[0236] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0237] In today's information society, there is a growing need to express and preserve personal memories and emotions in a richer way. However, there is a lack of readily available systems that allow users to easily generate content that reflects their emotions at any given time and save it in a personalized manner. Therefore, there is a need for means to express users' own emotions and memories in an integrated visual and auditory way.

[0238] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0239] In this invention, the server includes a device for inputting emotion data, a device for inputting image data, and a processing device for analyzing the emotion data and identifying emotion categories. This makes it possible to generate content based on the individual emotions of the user.

[0240] A "device for inputting emotional data" is a device that provides an interface for users to input their own emotions in text or voice.

[0241] A "device for inputting image data" is a device that has an interface for users to upload images and photographs.

[0242] A "processing device that analyzes emotional data and identifies emotional categories" is a device that analyzes input emotional data using natural language processing technology and other methods to determine a specific emotional category.

[0243] A "processing device that generates lyrics using generative AI technology" is a device that generates prompts based on specific emotional categories, inputs them into a generative AI model, and automatically generates lyrics.

[0244] A "processing device that generates music using a music generation algorithm" is a device that has the function of selecting a suitable musical style from a specific emotional category and creating music based on that style.

[0245] A "processing device that integrates generated lyrics and music to create a digital album" is a device that combines generated lyrics and music, and further integrates them with image data to create a digital album.

[0246] A "processing device for incorporating image data into a digital album" is a device that integrates image data provided by the user along with generated music and lyrics, and incorporates visual elements into the completed digital content.

[0247] This invention is a system that generates personalized digital albums based on the user's emotions and images. First, the user inputs their emotions as text and uploads images to the system via a terminal. The terminal sends the input data to a server, where the data is analyzed and processed.

[0248] The server analyzes the received emotional data using natural language processing technologies (such as NLTK and SpaCy) to identify specific emotional categories. Based on these analysis results, it generates lyrics using generative AI technology. Specifically, prompt sentences are created as input to the generative AI model, giving instructions to the AI in the form of, for example, "Please generate lyrics that reflect deeply moving emotions."

[0249] Next, the server uses a music generation algorithm to create music corresponding to the identified emotion category. For example, it might select a classical music genre to match the emotion of "emotional," and then generate a melody using a music generation tool. Specific music generation tools used include music generation software.

[0250] Furthermore, the server integrates the generated lyrics and music, and combines them with images uploaded by the user. This creates a digital album in which visual information, music, and lyrics are unified. This album is rendered using a multimedia editing library such as FFmpeg.

[0251] The completed digital album is sent from the server to the user's device, where the user can view, save, and even share it with other users. For example, if a user wants to record a "memorable" moment at their graduation ceremony, an album will be generated that combines lyrics based on those emotions with classical music. This system allows users to emotionally record important moments and easily relive them.

[0252] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0253] Step 1:

[0254] The user inputs emotion data and image data through their device. Specifically, they input emotions in text format and upload image files. The system receives text data of emotions and image files as input.

[0255] Step 2:

[0256] The terminal sends the input emotion data and image data to the server. The data is transmitted in a secure format using the HTTP protocol. The output of this step is the emotion text and image data received by the server.

[0257] Step 3:

[0258] The server analyzes the received sentiment data using natural language processing (NLP) techniques to identify sentiment categories. This process utilizes NLP libraries, such as NLTK or SpaCy, to analyze the input text data. The output of this analysis is the identified sentiment category.

[0259] Step 4:

[0260] The server generates lyrics using a generative AI model based on the identified emotion categories. To do this, it creates a prompt and sends it to the model. An example of a prompt is "Please generate lyrics that express a deeply moving emotion." The output of the AI processing is the lyrics generated based on the emotion.

[0261] Step 5:

[0262] The server utilizes music generation algorithms to produce music corresponding to emotional categories. For example, it might generate classical music corresponding to the emotion of "deep emotion." The use of a music generation tool is required, and the output is the generated music data.

[0263] Step 6:

[0264] The server integrates the generated lyrics and music and combines them with image data. This integration process utilizes multimedia editing libraries such as FFmpeg. The output is a completed digital album.

[0265] Step 7:

[0266] The server sends the completed digital album to the device. The user can view, save, and share the album with other users via the device. The output is the completed album data accessible to the user.

[0267] (Application Example 1)

[0268] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0269] A challenge lies in the lack of means to emotionally and visually record important family and personal moments and memories. Furthermore, there is a need to ensure that recorded digital content is preserved not merely as data, but in a way that imbues it with emotional value.

[0270] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0271] In this invention, the server includes means for receiving emotional information, means for receiving image information, and means for converting speech to text. This enables the automatic generation of digital albums with music based on the user's emotions.

[0272] "Means for receiving emotional information" refers to a device or method for electronically receiving emotional data entered by a user.

[0273] "Means for receiving image information" refers to a device or method for receiving and recording image data provided by a user.

[0274] "Means for converting speech to text" refers to a technology or device that analyzes speech data and converts it into corresponding text data.

[0275] "Means for analyzing emotional information and identifying emotional categories" refers to a technology or process for analyzing input emotional data and classifying it into a specific emotional category.

[0276] "Means for generating lyrics" refers to a technology or device for automatically producing appropriate lyrics based on a specific emotional category.

[0277] "Means for generating music" refers to technologies or devices for composing or synthesizing music that corresponds to emotional categories.

[0278] "Means for integrating generated lyrics and music to create an album" refers to a technology or method for combining created lyrics and music and structuring them in an album format along with visual elements.

[0279] "Means for incorporating image information into an album" refers to a method or apparatus for incorporating user-provided images into a digital album to create a completed album.

[0280] To implement this invention, the user begins by inputting emotional and image information into the system via a terminal. The terminal receives voice input and utilizes speech recognition technology to convert the speech into text. This technology utilizes the Google Cloud Speech-to-Text API.

[0281] Data sent from the terminal is aggregated on the server. The server analyzes the received text data using natural language processing technology and identifies the emotion category. Python and the natural language processing library NLTK are used for the analysis. Once the emotion category is identified, the server automatically generates the corresponding lyrics using a generative AI model. OpenAI's GPT model is used for lyric generation.

[0282] Next, the server generates appropriate music using music generation technology. Music generation libraries such as Magenta are useful for genre selection. This creates music that matches the specified emotional category.

[0283] The generated lyrics and music are integrated on a server, and combined with image information to complete the digital album. Media manipulation libraries such as FFmpeg are used for this integration. Finally, the album is sent to the user's device, where it can be viewed or shared.

[0284] As a specific example, when a user inputs a voice such as "I want to record the happy memories of having a family picnic today", the voice is converted into text, and the emotion category of "happy" is identified. Based on this, lyrics and music with a bright and happy theme are generated, and a digital album is created together with the photos. The following prompt sentence is used: "I had a picnic with my family today. It was very enjoyable. I want to record that memory. The emotion category is 'happy', and I want lyrics and music that convey the scene of the picnic."

[0285] The flow of the specific process in Application Example 1 will be described using FIG. 12.

[0286] Step 1:

[0287] The terminal receives voice input from the user. The user provides image information to be recorded together with a specific emotion. The voice data is converted into text data using the Google Cloud Speech-to-Text API. The input is voice data, and the output is text data.

[0288] Step 2:

[0289] The text data is sent to the server, and the server analyzes the text data using natural language processing technology. The Python and NLTK libraries are utilized to identify the emotion category. The input is text data, and the output is the emotion category.

[0290] Step 3:

[0291] Based on the emotion category, the server uses the OpenAI GPT model to generate relevant lyrics. The input is the emotion category, and the output is the generated lyrics. In this process, the generative AI model understands the emotion category used as a prompt and creates appropriate lyrics.

[0292] Step 4:

[0293] The server generates music based on emotion categories. It uses music generation libraries such as Magenta to compose music that matches the emotion. The input is the emotion category, and the output is the generated music.

[0294] Step 5:

[0295] The generated lyrics and music are integrated on the server. Image information is also incorporated as part of the album. Media manipulation libraries such as FFmpeg are used to integrate them into a slideshow format. The input consists of lyrics, music, and image information, and the output is a digital album.

[0296] Step 6:

[0297] The server sends the completed digital album to the user's device. The user can view and save the digital album on their device. The input is the digital album, and the output is the album displayed on the user's device.

[0298] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0299] The present invention aims to recognize a user's emotions in more detail and generate music that aligns with those emotions. To this end, the system incorporates an emotion engine that can recognize emotions by analyzing not only the emotional information input by the user, but also the user's facial expressions, voice tone, and biometric information. Because this emotion engine handles multiple sensory input data, it achieves highly accurate and reliable emotion recognition.

[0300] Users access the system using their devices and input their emotions and photos. Furthermore, the emotion engine uses the camera and microphone to capture and analyze the user's facial expressions and voice in real time. The device then sends this data to a server. The server aggregates and processes the transmitted data, using natural language processing techniques and machine learning algorithms to identify emotion categories. This process makes it possible to create music that takes into account not only the user's temporary emotions but also emotional changes and complex emotional patterns.

[0301] The server uses a generative AI to create appropriate lyrics based on the emotion category identified by the emotion engine. The music generation process considers the emotion category and corresponding music genre to generate a melody and rhythm that best fits the current emotion. The generated content is integrated into a single album along with image information provided by the user.

[0302] For example, if a user feels happy at a sporting event, the emotion engine analyzes not just the emotion of "happiness," but also the smile on their face and the excited tone of their voice, which are expressed in real time. As a result, the server generates a music album containing lively pop music and lyrics like an anthem, and provides it to the user's device. In this way, it is possible to create a music album that is more deeply intertwined with the user's experience and emotions.

[0303] This invention enhances the user experience and enables music generation that closely responds to individual emotions. This system enriches everyday emotions and allows for more immersive memory recording.

[0304] The following describes the processing flow.

[0305] Step 1:

[0306] The user accesses the system using a terminal. The user inputs photos related to their emotions and the situation at that time, and activates the camera and microphone if necessary. The terminal sends the input text data and photo data to the server.

[0307] Step 2:

[0308] With the user's permission, the terminal captures the user's facial expressions and voice tones in real time through the camera and microphone. These data are processed by the emotion engine for emotion recognition and sent to the server.

[0309] Step 3:

[0310] The server analyzes the received text, photos, and real-time data from the emotion engine. It analyzes the text data using natural language processing techniques to identify emotion categories. Additionally, it extracts additional emotion information from the facial expression data and voice tones to determine a comprehensive emotion category.

[0311] Step 4:

[0312] The server utilizes the generated AI based on the identified emotion category to automatically generate appropriate lyrics. At this stage, appropriate words are selected to fit the tone of the emotion.

[0313] Step 5:

[0314] The server creates an original melody and accompaniment using music generation AI, taking into account the emotion category and the selected music genre. The generated melody is designed to best harmonize with the user's emotions.

[0315] Step 6:

[0316] The server integrates the generated lyrics and music to create a visual album that includes the photos sent by the user. This forms digital content that integrates hearing and vision.

[0317] Step 7:

[0318] The server sends the completed album to the user's device. The device then displays this digital album so that the user can view, save, or share it with other users.

[0319] (Example 2)

[0320] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0321] Traditional music generation systems evaluate user emotions based on simple text input, making it difficult to generate personalized music that reflects subtle nuances and changes in those emotions. This limits the user experience and fails to provide a musical environment that fully resonates with their feelings.

[0322] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0323] In this invention, the server includes means for receiving emotional information, means for capturing facial expression data and recording voice tone, means for analyzing the emotional information, facial expression data, and voice tone to identify an emotional category, and means for generating lyrics based on the emotional category using a generative AI model. This makes it possible to recognize the user's diverse and subtle emotions with high accuracy and generate music and lyrics individually to match them.

[0324] "Emotional information" refers to input data in which a user expresses their internal emotional state in text or other forms.

[0325] "Image information" refers to images and photographic data related to the user, and is used as an element in interpreting emotions.

[0326] "Facial expression data" refers to data about the user's facial expressions captured through a camera.

[0327] "Voice tone" refers to data related to audio information obtained from the user's speaking style and voice quality.

[0328] An "emotional category" is a classification that indicates the emotional state of a user, identified from the analyzed emotional information.

[0329] A "generative AI model" is an artificial intelligence system used to generate music and lyrics based on emotional information.

[0330] "Musical style" refers to a specific musical genre or characteristic that corresponds to an emotional category.

[0331] This invention relates to a system that analyzes a user's emotions in a multidimensional manner and generates music based on that analysis. The user inputs their emotional information into the system through a terminal, and also provides facial expression data and voice tone using a camera and microphone. The terminal collects this data and transmits it to a server.

[0332] The server analyzes the received data using an emotion engine. Natural language processing techniques and machine learning algorithms are employed to identify emotion categories from the input emotion information, facial expression data, and voice tone. These emotion categories reflect specific emotional states (e.g., "joy," "excitement").

[0333] Next, the server uses a generative AI model to generate lyrics and music based on this emotion category. The generation prompt is an instruction to the AI such as, "Create a song that matches the user's emotions." The generative AI model considers the musical style corresponding to the emotion category and determines the melody and rhythm of the music.

[0334] The generated music and lyrics are integrated and combined with user-provided image information to create a digital album. This album is delivered to the user's device, providing an emotionally rich experience through both sight and sound. For example, if a user feels "excitement" at a sporting event, the server creates an energetic pop song and anthem-like lyrics, which are then delivered as a music album.

[0335] This system generates personalized music content in real time that is fully responsive to the user's emotions, allowing them to enjoy a deeper emotional experience.

[0336] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0337] Step 1:

[0338] Users access the system using a device and input their emotions as text or upload photos. The device collects data by capturing the user's facial expressions with its camera and recording their voice tone with its microphone. This process yields emotional information, facial expression data, and voice data as input data. This input data is packaged and sent to the server.

[0339] Step 2:

[0340] The server receives emotional information, facial expression data, and voice data transmitted from the terminal. This data is analyzed by an emotion engine to comprehensively evaluate the user's emotions. This analysis uses natural language processing techniques to extract emotions from text and machine learning algorithms to identify emotion categories from facial expression and voice data. The analysis results output an emotion category (e.g., "joy," "excitement").

[0341] Step 3:

[0342] The server generates appropriate lyrics using a generative AI model based on the emotion categories identified through analysis. Here, a prompt sentence based on the emotion category (e.g., "Create a song that matches the user's feelings of joy") is supplied to the generative AI model. Based on this prompt, the AI model generates appropriate lyrics, and those lyrics are output.

[0343] Step 4:

[0344] The server generates music based on the same emotion category. The generation AI model links emotion categories with musical styles to create the most appropriate melody and rhythm. This process outputs a music file that matches the emotion.

[0345] Step 5:

[0346] The server integrates the generated lyrics and music, thereby forming the music content. Simultaneously, it integrates image information provided by the user to create a personalized digital album. This album data becomes the final output.

[0347] Step 6:

[0348] The server delivers a personalized digital album to the user's device. The user can then play this digital album and enjoy their emotional experience visually and aurally.

[0349] (Application Example 2)

[0350] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0351] This invention aims to improve the user experience by generating music based on the user's emotions. However, existing systems have difficulty precisely identifying the user's facial expressions and tone of voice and providing music that matches their emotions in real time. Furthermore, these systems lack means to deepen interaction with the user when the robot plays music.

[0352] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0353] In this invention, the server includes means for receiving emotional information, means for receiving image information, and means for receiving audio information. This makes it possible to recognize the user's complex emotions with high accuracy and generate music and lyrics that match those emotions. Furthermore, by having the robot generate actions in accordance with the generated music, it becomes possible to enrich the interaction with the user.

[0354] "Emotional information" refers to data about emotions obtained based on the user's facial expressions, tone of voice, and other biometric data.

[0355] "Image information" refers to digital image or video data that includes the user's visual characteristics and environmental information.

[0356] "Audio information" refers to audio data that includes the tone and rhythm of the user's voice.

[0357] An "emotion category" is a type of emotion classified based on analyzed emotional information.

[0358] "Methods for generating lyrics" refer to techniques for constructing appropriate words and phrases based on emotional categories.

[0359] "Means of generating music" refers to techniques for creating melodies and rhythms that correspond to emotional categories.

[0360] "The means of creating an album" refers to the technology that integrates the generated lyrics and music into a single cohesive whole.

[0361] "Means for generating motion" refers to the technology for designing robot movements that are synchronized with music.

[0362] The system used to realize this application features the ability to recognize the user's emotions in detail and generate music accordingly. Users access the system via a smartphone or home robot terminal and input their emotions and image information. The terminal uses a camera and microphone to capture the user's facial expressions and voice tone in real time. This data is then transmitted from the terminal to a server.

[0363] The server analyzes the emotion category based on the received emotion and voice information. Natural language processing technology and emotion analysis algorithms are used for the analysis. This analysis makes it possible to recognize the complex patterns of the user's emotions with high accuracy.

[0364] Furthermore, the server uses an AI model to generate appropriate lyrics based on the emotional category and then produces music. The music genre is automatically selected based on the emotional category. The generated lyrics and music are integrated into a single album and provided to the user along with image information.

[0365] In addition, the home robot generates movements that match the music being played. The robot can move in sync with the generated music to provide an immersive experience similar to a live performance.

[0366] As a concrete example, imagine a scenario where, upon a user returning from work, the system recognizes their tired expression, generates relaxing music and lyrics, and a household robot performs soothing movements in time with the music. In this way, the individual user experience can be improved.

[0367] An example of a prompt to input into a generative AI model is: "Build a neural network that generates optimal relaxation music based on the user's emotions. Take into account the user's facial expressions and tone of voice."

[0368] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0369] Step 1:

[0370] The device uses a camera and microphone to capture the user's facial expressions and voice. The input data consists of image and audio information, which is collected in real time. This allows the device to acquire biometric information such as the user's facial expressions and voice tone.

[0371] Step 2:

[0372] The device sends the acquired image and audio information to the server as a dataset. The input here is the unprocessed image and audio data sent from the device. The server receives this and passes it on to the next analysis step.

[0373] Step 3:

[0374] The server analyzes the received image and audio information to extract emotional information. Specifically, it uses natural language processing and facial recognition algorithms to identify emotional categories. The input is image and audio data, and the output is emotional categories. This operation identifies the emotional state of the user.

[0375] Step 4:

[0376] The server generates lyrics using a generative AI model based on emotion categories. Emotion categories are used as input, and corresponding lyric data is generated as output. The AI model creates linguistic expressions appropriate to the emotions expressed.

[0377] Step 5:

[0378] The server similarly generates music based on emotion categories. The input is an emotion category, and the output is a corresponding music track. The music generation algorithm automatically creates melodies and rhythms that match the emotion.

[0379] Step 6:

[0380] The server integrates the generated lyrics and music to create an album. Image information is also incorporated, completing a single listening experience. The inputs here are lyrics, music, and image information, and the output is the integrated album. This integration enables a personalized music experience for each user.

[0381] Step 7:

[0382] The robotic device plays the music from the album and generates actions. Emotional information received as user input is reused here, and the robot performs interactive actions synchronized with the music. This synchronizes the music and movement, providing a more immersive experience.

[0383] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0384] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0385] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0386] [Third Embodiment]

[0387] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0388] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0389] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0390] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0391] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0392] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0393] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0394] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0395] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0396] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0397] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0398] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0399] The present invention's system analyzes emotional information based on user input of emotions and images as visual records, and generates personalized digital albums with music. Users access the system via a terminal and input emotions and photos according to their intention to record their experiences. This allows for the creation of unique musical works based on the user's memories.

[0400] When a user enters information, the device sends that data to the server. The server receives this information and identifies the emotion category by analyzing the text data using natural language processing technology. Once the emotion category is identified, the server uses generative AI to generate highly relevant lyrics. Simultaneously, the server uses music generation technology to generate music corresponding to the selected emotion category. This process is carried out while taking into account the characteristics of the chosen music genre.

[0401] Once the generated lyrics and music are complete, the server integrates them and combines them with image information to create a digital album. The final album is sent to the user's device, where the user can view, save, or share it with others. The album created in this process will have high emotional value for the user, providing a means to record and preserve important moments in life through music.

[0402] For example, if a user wants to record a memorable moment at their graduation ceremony, they would input the emotion as "memorable" and upload a graduation photo to the system. In this case, the server would generate lyrics based on the emotion category "memorable," select a classical music genre, and create music. Afterwards, the user would be provided with a completed album containing their photo. In this way, users can record and reminisce about important moments through music.

[0403] The following describes the processing flow.

[0404] Step 1:

[0405] The user inputs data into the device through a dedicated interface to enter emotions and related photos. The device receives this user input and sends it to the server as formatted data.

[0406] Step 2:

[0407] The server receives emotion and image information transmitted from the terminal. It performs validation to ensure the data is correctly interpretable, particularly verifying that the emotion information is in the expected format.

[0408] Step 3:

[0409] The server analyzes the received sentiment information using natural language processing (NLP) techniques. This identifies the main sentiment themes from the input text and assigns corresponding sentiment categories.

[0410] Step 4:

[0411] The server utilizes a generative AI model based on identified emotion categories to automatically generate lyrics that match the user's emotions. This process pays attention to the nuances and message of the words.

[0412] Step 5:

[0413] The server determines the emotion category and associated music genre, and then executes a music generation algorithm based on that. This creates an original melody and accompaniment that harmonizes with the emotion.

[0414] Step 6:

[0415] The server integrates the generated lyrics and music, and further adds user-provided image information as visual elements of the album. This creates a digital album that appeals to the user both visually and aurally.

[0416] Step 7:

[0417] The server sends the completed album to the device. The device displays the received album data so that the user can view it and supports actions such as saving and sharing.

[0418] (Example 1)

[0419] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0420] In today's information society, there is a growing need to express and preserve personal memories and emotions in a richer way. However, there is a lack of readily available systems that allow users to easily generate content that reflects their emotions at any given time and save it in a personalized manner. Therefore, there is a need for means to express users' own emotions and memories in an integrated visual and auditory way.

[0421] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0422] In this invention, the server includes a device for inputting emotion data, a device for inputting image data, and a processing device for analyzing the emotion data and identifying emotion categories. This makes it possible to generate content based on the individual emotions of the user.

[0423] A "device for inputting emotional data" is a device that provides an interface for users to input their own emotions in text or voice.

[0424] A "device for inputting image data" is a device that has an interface for users to upload images and photographs.

[0425] A "processing device that analyzes emotional data and identifies emotional categories" is a device that analyzes input emotional data using natural language processing technology and other methods to determine a specific emotional category.

[0426] A "processing device that generates lyrics using generative AI technology" is a device that generates prompts based on specific emotional categories, inputs them into a generative AI model, and automatically generates lyrics.

[0427] A "processing device that generates music using a music generation algorithm" is a device that has the function of selecting a suitable musical style from a specific emotional category and creating music based on that style.

[0428] A "processing device that integrates generated lyrics and music to create a digital album" is a device that combines generated lyrics and music, and further integrates them with image data to create a digital album.

[0429] A "processing device for incorporating image data into a digital album" is a device that integrates image data provided by the user along with generated music and lyrics, and incorporates visual elements into the completed digital content.

[0430] This invention is a system that generates personalized digital albums based on the user's emotions and images. First, the user inputs their emotions as text and uploads images to the system via a terminal. The terminal sends the input data to a server, where the data is analyzed and processed.

[0431] The server analyzes the received emotional data using natural language processing technologies (such as NLTK and SpaCy) to identify specific emotional categories. Based on these analysis results, it generates lyrics using generative AI technology. Specifically, prompt sentences are created as input to the generative AI model, giving instructions to the AI in the form of, for example, "Please generate lyrics that reflect deeply moving emotions."

[0432] Next, the server uses a music generation algorithm to create music corresponding to the identified emotion category. For example, it might select a classical music genre to match the emotion of "emotional," and then generate a melody using a music generation tool. Specific music generation tools used include music generation software.

[0433] Furthermore, the server integrates the generated lyrics and music, and combines them with images uploaded by the user. This creates a digital album in which visual information, music, and lyrics are unified. This album is rendered using a multimedia editing library such as FFmpeg.

[0434] The completed digital album is sent from the server to the user's device, where the user can view, save, and even share it with other users. For example, if a user wants to record a "memorable" moment at their graduation ceremony, an album will be generated that combines lyrics based on those emotions with classical music. This system allows users to emotionally record important moments and easily relive them.

[0435] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0436] Step 1:

[0437] The user inputs emotion data and image data through their device. Specifically, they input emotions in text format and upload image files. The system receives text data of emotions and image files as input.

[0438] Step 2:

[0439] The terminal sends the input emotion data and image data to the server. The data is transmitted in a secure format using the HTTP protocol. The output of this step is the emotion text and image data received by the server.

[0440] Step 3:

[0441] The server analyzes the received sentiment data using natural language processing (NLP) techniques to identify sentiment categories. This process utilizes NLP libraries, such as NLTK or SpaCy, to analyze the input text data. The output of this analysis is the identified sentiment category.

[0442] Step 4:

[0443] The server generates lyrics using a generative AI model based on the identified emotion categories. To do this, it creates a prompt and sends it to the model. An example of a prompt is "Please generate lyrics that express a deeply moving emotion." The output of the AI processing is the lyrics generated based on the emotion.

[0444] Step 5:

[0445] The server utilizes music generation algorithms to produce music corresponding to emotional categories. For example, it might generate classical music corresponding to the emotion of "deep emotion." The use of a music generation tool is required, and the output is the generated music data.

[0446] Step 6:

[0447] The server integrates the generated lyrics and music and combines them with image data. This integration process utilizes multimedia editing libraries such as FFmpeg. The output is a completed digital album.

[0448] Step 7:

[0449] The server sends the completed digital album to the device. The user can view, save, and share the album with other users via the device. The output is the completed album data accessible to the user.

[0450] (Application Example 1)

[0451] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0452] A challenge lies in the lack of means to emotionally and visually record important family and personal moments and memories. Furthermore, there is a need to ensure that recorded digital content is preserved not merely as data, but in a way that imbues it with emotional value.

[0453] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0454] In this invention, the server includes means for receiving emotional information, means for receiving image information, and means for converting speech to text. This enables the automatic generation of digital albums with music based on the user's emotions.

[0455] "Means for receiving emotional information" refers to a device or method for electronically receiving emotional data entered by a user.

[0456] "Means for receiving image information" refers to a device or method for receiving and recording image data provided by a user.

[0457] "Means for converting speech to text" refers to a technology or device that analyzes speech data and converts it into corresponding text data.

[0458] "Means for analyzing emotional information and identifying emotional categories" refers to a technology or process for analyzing input emotional data and classifying it into a specific emotional category.

[0459] "Means for generating lyrics" refers to a technology or device for automatically producing appropriate lyrics based on a specific emotional category.

[0460] "Means for generating music" refers to technologies or devices for composing or synthesizing music that corresponds to emotional categories.

[0461] "Means for integrating generated lyrics and music to create an album" refers to a technology or method for combining created lyrics and music and structuring them in an album format along with visual elements.

[0462] "Means for incorporating image information into an album" refers to a method or apparatus for incorporating user-provided images into a digital album to create a completed album.

[0463] To implement this invention, the user begins by inputting emotional and image information into the system via a terminal. The terminal receives voice input and utilizes speech recognition technology to convert the speech into text. This technology utilizes the Google Cloud Speech-to-Text API.

[0464] Data sent from the terminal is aggregated on the server. The server analyzes the received text data using natural language processing technology and identifies the emotion category. Python and the natural language processing library NLTK are used for the analysis. Once the emotion category is identified, the server automatically generates the corresponding lyrics using a generative AI model. OpenAI's GPT model is used for lyric generation.

[0465] Next, the server generates appropriate music using music generation technology. Music generation libraries such as Magenta are useful for genre selection. This creates music that matches the specified emotional category.

[0466] The generated lyrics and music are integrated on a server, and combined with image information to complete the digital album. Media manipulation libraries such as FFmpeg are used for this integration. Finally, the album is sent to the user's device, where it can be viewed or shared.

[0467] For example, if a user inputs something like, "I want to record the fun memories of the family picnic I had today," the audio is converted to text, and the emotion category "fun" is identified. Based on this, cheerful and upbeat lyrics and music are generated, and a digital album is created along with photos. A prompt like the following might be used: "I had a picnic with my family today. It was so much fun. I want to record those memories. The emotion category is 'fun,' and I want lyrics and music that convey the feeling of the picnic."

[0468] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0469] Step 1:

[0470] The device receives voice input from the user. The user provides image information they want to record along with a specific emotion. The voice data is converted into text data using the Google Cloud Speech-to-Text API. The input is voice data, and the output is text data.

[0471] Step 2:

[0472] Text data is sent to a server, which then uses natural language processing techniques to analyze it. Python and the NLTK library are utilized to identify sentiment categories. The input is text data, and the output is sentiment categories.

[0473] Step 3:

[0474] The server generates relevant lyrics based on emotion categories using an OpenAI GPT model. The input is an emotion category, and the output is the generated lyrics. In this process, the generative AI model understands the emotion category used as a prompt and creates appropriate lyrics.

[0475] Step 4:

[0476] The server generates music based on emotion categories. It uses music generation libraries such as Magenta to compose music that matches the emotion. The input is the emotion category, and the output is the generated music.

[0477] Step 5:

[0478] The generated lyrics and music are integrated on the server. Image information is also incorporated as part of the album. Media manipulation libraries such as FFmpeg are used to integrate them into a slideshow format. The input consists of lyrics, music, and image information, and the output is a digital album.

[0479] Step 6:

[0480] The server sends the completed digital album to the user's device. The user can view and save the digital album on their device. The input is the digital album, and the output is the album displayed on the user's device.

[0481] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0482] The present invention aims to recognize a user's emotions in more detail and generate music that aligns with those emotions. To this end, the system incorporates an emotion engine that can recognize emotions by analyzing not only the emotional information input by the user, but also the user's facial expressions, voice tone, and biometric information. Because this emotion engine handles multiple sensory input data, it achieves highly accurate and reliable emotion recognition.

[0483] Users access the system using their devices and input their emotions and photos. Furthermore, the emotion engine uses the camera and microphone to capture and analyze the user's facial expressions and voice in real time. The device then sends this data to a server. The server aggregates and processes the transmitted data, using natural language processing techniques and machine learning algorithms to identify emotion categories. This process makes it possible to create music that takes into account not only the user's temporary emotions but also emotional changes and complex emotional patterns.

[0484] The server uses a generative AI to create appropriate lyrics based on the emotion category identified by the emotion engine. The music generation process considers the emotion category and corresponding music genre to generate a melody and rhythm that best fits the current emotion. The generated content is integrated into a single album along with image information provided by the user.

[0485] For example, if a user feels happy at a sporting event, the emotion engine analyzes not just the emotion of "happiness," but also the smile on their face and the excited tone of their voice, which are expressed in real time. As a result, the server generates a music album containing lively pop music and lyrics like an anthem, and provides it to the user's device. In this way, it is possible to create a music album that is more deeply intertwined with the user's experience and emotions.

[0486] This invention enhances the user experience and enables music generation that closely responds to individual emotions. This system enriches everyday emotions and allows for more immersive memory recording.

[0487] The following describes the processing flow.

[0488] Step 1:

[0489] The user accesses the system using a device. The user inputs photos related to their emotions and the situation at the time, and activates the camera and microphone as needed. The device sends the input text data and photo data to the server.

[0490] Step 2:

[0491] With the user's permission, the device captures the user's facial expressions and voice tone in real time via the camera and microphone. This data is processed by an emotion engine for emotion recognition and then sent to a server.

[0492] Step 3:

[0493] The server analyzes incoming text, photos, and real-time data from the emotion engine. It uses natural language processing techniques to analyze text data and identify emotion categories. It also extracts additional emotion information from facial expression data and voice tone to determine an overall emotion category.

[0494] Step 4:

[0495] The server uses AI to automatically generate appropriate lyrics based on identified emotion categories. At this stage, appropriate words that fit the emotional tone are selected.

[0496] Step 5:

[0497] The server takes into account the emotion category and selected music genre, and uses music generation AI to create original melodies and accompaniments. The generated melodies are designed to best harmonize with the user's emotions.

[0498] Step 6:

[0499] The server integrates the generated lyrics and music, and creates a visual album that includes photos submitted by the user. This results in a digital content that integrates auditory and visual elements.

[0500] Step 7:

[0501] The server sends the completed album to the user's device. The device then displays this digital album so that the user can view, save, or share it with other users.

[0502] (Example 2)

[0503] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0504] Traditional music generation systems evaluate user emotions based on simple text input, making it difficult to generate personalized music that reflects subtle nuances and changes in those emotions. This limits the user experience and fails to provide a musical environment that fully resonates with their feelings.

[0505] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0506] In this invention, the server includes means for receiving emotional information, means for capturing facial expression data and recording voice tone, means for analyzing the emotional information, facial expression data, and voice tone to identify an emotional category, and means for generating lyrics based on the emotional category using a generative AI model. This makes it possible to recognize the user's diverse and subtle emotions with high accuracy and generate music and lyrics individually to match them.

[0507] "Emotional information" refers to input data in which a user expresses their internal emotional state in text or other forms.

[0508] "Image information" refers to images and photographic data related to the user, and is used as an element in interpreting emotions.

[0509] "Facial expression data" refers to data about the user's facial expressions captured through a camera.

[0510] "Voice tone" refers to data related to audio information obtained from the user's speaking style and voice quality.

[0511] An "emotional category" is a classification that indicates the emotional state of a user, identified from the analyzed emotional information.

[0512] A "generative AI model" is an artificial intelligence system used to generate music and lyrics based on emotional information.

[0513] "Musical style" refers to a specific musical genre or characteristic that corresponds to an emotional category.

[0514] This invention relates to a system that analyzes a user's emotions in a multidimensional manner and generates music based on that analysis. The user inputs their emotional information into the system through a terminal, and also provides facial expression data and voice tone using a camera and microphone. The terminal collects this data and transmits it to a server.

[0515] The server analyzes the received data using an emotion engine. Natural language processing techniques and machine learning algorithms are employed to identify emotion categories from the input emotion information, facial expression data, and voice tone. These emotion categories reflect specific emotional states (e.g., "joy," "excitement").

[0516] Next, the server uses a generative AI model to generate lyrics and music based on this emotion category. The generation prompt is an instruction to the AI such as, "Create a song that matches the user's emotions." The generative AI model considers the musical style corresponding to the emotion category and determines the melody and rhythm of the music.

[0517] The generated music and lyrics are integrated and combined with user-provided image information to create a digital album. This album is delivered to the user's device, providing an emotionally rich experience through both sight and sound. For example, if a user feels "excitement" at a sporting event, the server creates an energetic pop song and anthem-like lyrics, which are then delivered as a music album.

[0518] This system generates personalized music content in real time that is fully responsive to the user's emotions, allowing them to enjoy a deeper emotional experience.

[0519] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0520] Step 1:

[0521] Users access the system using a device and input their emotions as text or upload photos. The device collects data by capturing the user's facial expressions with its camera and recording their voice tone with its microphone. This process yields emotional information, facial expression data, and voice data as input data. This input data is packaged and sent to the server.

[0522] Step 2:

[0523] The server receives emotional information, facial expression data, and voice data transmitted from the terminal. This data is analyzed by an emotion engine to comprehensively evaluate the user's emotions. This analysis uses natural language processing techniques to extract emotions from text and machine learning algorithms to identify emotion categories from facial expression and voice data. The analysis results output an emotion category (e.g., "joy," "excitement").

[0524] Step 3:

[0525] The server generates appropriate lyrics using a generative AI model based on the emotion categories identified through analysis. Here, a prompt sentence based on the emotion category (e.g., "Create a song that matches the user's feelings of joy") is supplied to the generative AI model. Based on this prompt, the AI model generates appropriate lyrics, and those lyrics are output.

[0526] Step 4:

[0527] The server generates music based on the same emotion category. The generation AI model links emotion categories with musical styles to create the most appropriate melody and rhythm. This process outputs a music file that matches the emotion.

[0528] Step 5:

[0529] The server integrates the generated lyrics and music, thereby forming the music content. Simultaneously, it integrates image information provided by the user to create a personalized digital album. This album data becomes the final output.

[0530] Step 6:

[0531] The server delivers a personalized digital album to the user's device. The user can then play this digital album and enjoy their emotional experience visually and aurally.

[0532] (Application Example 2)

[0533] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0534] This invention aims to improve the user experience by generating music based on the user's emotions. However, existing systems have difficulty precisely identifying the user's facial expressions and tone of voice and providing music that matches their emotions in real time. Furthermore, these systems lack means to deepen interaction with the user when the robot plays music.

[0535] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0536] In this invention, the server includes means for receiving emotional information, means for receiving image information, and means for receiving audio information. This makes it possible to recognize the user's complex emotions with high accuracy and generate music and lyrics that match those emotions. Furthermore, by having the robot generate actions in accordance with the generated music, it becomes possible to enrich the interaction with the user.

[0537] "Emotional information" refers to data about emotions obtained based on the user's facial expressions, tone of voice, and other biometric data.

[0538] "Image information" refers to digital image or video data that includes the user's visual characteristics and environmental information.

[0539] "Audio information" refers to audio data that includes the tone and rhythm of the user's voice.

[0540] An "emotion category" is a type of emotion classified based on analyzed emotional information.

[0541] "Methods for generating lyrics" refer to techniques for constructing appropriate words and phrases based on emotional categories.

[0542] "Means of generating music" refers to techniques for creating melodies and rhythms that correspond to emotional categories.

[0543] "The means of creating an album" refers to the technology that integrates the generated lyrics and music into a single cohesive whole.

[0544] "Means for generating motion" refers to the technology for designing robot movements that are synchronized with music.

[0545] The system used to realize this application features the ability to recognize the user's emotions in detail and generate music accordingly. Users access the system via a smartphone or home robot terminal and input their emotions and image information. The terminal uses a camera and microphone to capture the user's facial expressions and voice tone in real time. This data is then transmitted from the terminal to a server.

[0546] The server analyzes the emotion category based on the received emotion and voice information. Natural language processing technology and emotion analysis algorithms are used for the analysis. This analysis makes it possible to recognize the complex patterns of the user's emotions with high accuracy.

[0547] Furthermore, the server uses an AI model to generate appropriate lyrics based on the emotional category and then produces music. The music genre is automatically selected based on the emotional category. The generated lyrics and music are integrated into a single album and provided to the user along with image information.

[0548] In addition, the home robot generates movements that match the music being played. The robot can move in sync with the generated music to provide an immersive experience similar to a live performance.

[0549] As a concrete example, imagine a scenario where, upon a user returning from work, the system recognizes their tired expression, generates relaxing music and lyrics, and a household robot performs soothing movements in time with the music. In this way, the individual user experience can be improved.

[0550] An example of a prompt to input into a generative AI model is: "Build a neural network that generates optimal relaxation music based on the user's emotions. Take into account the user's facial expressions and tone of voice."

[0551] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0552] Step 1:

[0553] The device uses a camera and microphone to capture the user's facial expressions and voice. The input data consists of image and audio information, which is collected in real time. This allows the device to acquire biometric information such as the user's facial expressions and voice tone.

[0554] Step 2:

[0555] The device sends the acquired image and audio information to the server as a dataset. The input here is the unprocessed image and audio data sent from the device. The server receives this and passes it on to the next analysis step.

[0556] Step 3:

[0557] The server analyzes the received image and audio information to extract emotional information. Specifically, it uses natural language processing and facial recognition algorithms to identify emotional categories. The input is image and audio data, and the output is emotional categories. This operation identifies the emotional state of the user.

[0558] Step 4:

[0559] The server generates lyrics using a generative AI model based on emotion categories. Emotion categories are used as input, and corresponding lyric data is generated as output. The AI model creates linguistic expressions appropriate to the emotions expressed.

[0560] Step 5:

[0561] The server similarly generates music based on emotion categories. The input is an emotion category, and the output is a corresponding music track. The music generation algorithm automatically creates melodies and rhythms that match the emotion.

[0562] Step 6:

[0563] The server integrates the generated lyrics and music to create an album. Image information is also incorporated, completing a single listening experience. The inputs here are lyrics, music, and image information, and the output is the integrated album. This integration enables a personalized music experience for each user.

[0564] Step 7:

[0565] The robotic device plays the music from the album and generates actions. Emotional information received as user input is reused here, and the robot performs interactive actions synchronized with the music. This synchronizes the music and movement, providing a more immersive experience.

[0566] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0567] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0568] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0569] [Fourth Embodiment]

[0570] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0571] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0572] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0573] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0574] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0575] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0576] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0577] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0578] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0579] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0580] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0581] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0582] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0583] The present invention's system analyzes emotional information based on user input of emotions and images as visual records, and generates personalized digital albums with music. Users access the system via a terminal and input emotions and photos according to their intention to record their experiences. This allows for the creation of unique musical works based on the user's memories.

[0584] When a user enters information, the device sends that data to the server. The server receives this information and identifies the emotion category by analyzing the text data using natural language processing technology. Once the emotion category is identified, the server uses generative AI to generate highly relevant lyrics. Simultaneously, the server uses music generation technology to generate music corresponding to the selected emotion category. This process is carried out while taking into account the characteristics of the chosen music genre.

[0585] Once the generated lyrics and music are complete, the server integrates them and combines them with image information to create a digital album. The final album is sent to the user's device, where the user can view, save, or share it with others. The album created in this process will have high emotional value for the user, providing a means to record and preserve important moments in life through music.

[0586] For example, if a user wants to record a memorable moment at their graduation ceremony, they would input the emotion as "memorable" and upload a graduation photo to the system. In this case, the server would generate lyrics based on the emotion category "memorable," select a classical music genre, and create music. Afterwards, the user would be provided with a completed album containing their photo. In this way, users can record and reminisce about important moments through music.

[0587] The following describes the processing flow.

[0588] Step 1:

[0589] The user inputs data into the device through a dedicated interface to enter emotions and related photos. The device receives this user input and sends it to the server as formatted data.

[0590] Step 2:

[0591] The server receives emotion and image information transmitted from the terminal. It performs validation to ensure the data is correctly interpretable, particularly verifying that the emotion information is in the expected format.

[0592] Step 3:

[0593] The server analyzes the received sentiment information using natural language processing (NLP) techniques. This identifies the main sentiment themes from the input text and assigns corresponding sentiment categories.

[0594] Step 4:

[0595] The server utilizes a generative AI model based on identified emotion categories to automatically generate lyrics that match the user's emotions. This process pays attention to the nuances and message of the words.

[0596] Step 5:

[0597] The server determines the emotion category and associated music genre, and then executes a music generation algorithm based on that. This creates an original melody and accompaniment that harmonizes with the emotion.

[0598] Step 6:

[0599] The server integrates the generated lyrics and music, and further adds user-provided image information as visual elements of the album. This creates a digital album that appeals to the user both visually and aurally.

[0600] Step 7:

[0601] The server sends the completed album to the device. The device displays the received album data so that the user can view it and supports actions such as saving and sharing.

[0602] (Example 1)

[0603] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0604] In today's information society, there is a growing need to express and preserve personal memories and emotions in a richer way. However, there is a lack of readily available systems that allow users to easily generate content that reflects their emotions at any given time and save it in a personalized manner. Therefore, there is a need for means to express users' own emotions and memories in an integrated visual and auditory way.

[0605] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0606] In this invention, the server includes a device for inputting emotion data, a device for inputting image data, and a processing device for analyzing the emotion data and identifying emotion categories. This makes it possible to generate content based on the individual emotions of the user.

[0607] A "device for inputting emotional data" is a device that provides an interface for users to input their own emotions in text or voice.

[0608] A "device for inputting image data" is a device that has an interface for users to upload images and photographs.

[0609] A "processing device that analyzes emotional data and identifies emotional categories" is a device that analyzes input emotional data using natural language processing technology and other methods to determine a specific emotional category.

[0610] A "processing device that generates lyrics using generative AI technology" is a device that generates prompts based on specific emotional categories, inputs them into a generative AI model, and automatically generates lyrics.

[0611] A "processing device that generates music using a music generation algorithm" is a device that has the function of selecting a suitable musical style from a specific emotional category and creating music based on that style.

[0612] A "processing device that integrates generated lyrics and music to create a digital album" is a device that combines generated lyrics and music, and further integrates them with image data to create a digital album.

[0613] A "processing device for incorporating image data into a digital album" is a device that integrates image data provided by the user along with generated music and lyrics, and incorporates visual elements into the completed digital content.

[0614] This invention is a system that generates personalized digital albums based on the user's emotions and images. First, the user inputs their emotions as text and uploads images to the system via a terminal. The terminal sends the input data to a server, where the data is analyzed and processed.

[0615] The server analyzes the received emotional data using natural language processing technologies (such as NLTK and SpaCy) to identify specific emotional categories. Based on these analysis results, it generates lyrics using generative AI technology. Specifically, prompt sentences are created as input to the generative AI model, giving instructions to the AI in the form of, for example, "Please generate lyrics that reflect deeply moving emotions."

[0616] Next, the server uses a music generation algorithm to create music corresponding to the identified emotion category. For example, it might select a classical music genre to match the emotion of "emotional," and then generate a melody using a music generation tool. Specific music generation tools used include music generation software.

[0617] Furthermore, the server integrates the generated lyrics and music, and combines them with images uploaded by the user. This creates a digital album in which visual information, music, and lyrics are unified. This album is rendered using a multimedia editing library such as FFmpeg.

[0618] The completed digital album is sent from the server to the user's device, where the user can view, save, and even share it with other users. For example, if a user wants to record a "memorable" moment at their graduation ceremony, an album will be generated that combines lyrics based on those emotions with classical music. This system allows users to emotionally record important moments and easily relive them.

[0619] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0620] Step 1:

[0621] The user inputs emotion data and image data through their device. Specifically, they input emotions in text format and upload image files. The system receives text data of emotions and image files as input.

[0622] Step 2:

[0623] The terminal sends the input emotion data and image data to the server. The data is transmitted in a secure format using the HTTP protocol. The output of this step is the emotion text and image data received by the server.

[0624] Step 3:

[0625] The server analyzes the received sentiment data using natural language processing (NLP) techniques to identify sentiment categories. This process utilizes NLP libraries, such as NLTK or SpaCy, to analyze the input text data. The output of this analysis is the identified sentiment category.

[0626] Step 4:

[0627] The server generates lyrics using a generative AI model based on the identified emotion categories. To do this, it creates a prompt and sends it to the model. An example of a prompt is "Please generate lyrics that express a deeply moving emotion." The output of the AI processing is the lyrics generated based on the emotion.

[0628] Step 5:

[0629] The server utilizes music generation algorithms to produce music corresponding to emotional categories. For example, it might generate classical music corresponding to the emotion of "deep emotion." The use of a music generation tool is required, and the output is the generated music data.

[0630] Step 6:

[0631] The server integrates the generated lyrics and music and combines them with image data. This integration process utilizes multimedia editing libraries such as FFmpeg. The output is a completed digital album.

[0632] Step 7:

[0633] The server sends the completed digital album to the device. The user can view, save, and share the album with other users via the device. The output is the completed album data accessible to the user.

[0634] (Application Example 1)

[0635] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0636] A challenge lies in the lack of means to emotionally and visually record important family and personal moments and memories. Furthermore, there is a need to ensure that recorded digital content is preserved not merely as data, but in a way that imbues it with emotional value.

[0637] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0638] In this invention, the server includes means for receiving emotional information, means for receiving image information, and means for converting speech to text. This enables the automatic generation of digital albums with music based on the user's emotions.

[0639] "Means for receiving emotional information" refers to a device or method for electronically receiving emotional data entered by a user.

[0640] "Means for receiving image information" refers to a device or method for receiving and recording image data provided by a user.

[0641] "Means for converting speech to text" refers to a technology or device that analyzes speech data and converts it into corresponding text data.

[0642] "Means for analyzing emotional information and identifying emotional categories" refers to a technology or process for analyzing input emotional data and classifying it into a specific emotional category.

[0643] "Means for generating lyrics" refers to a technology or device for automatically producing appropriate lyrics based on a specific emotional category.

[0644] "Means for generating music" refers to technologies or devices for composing or synthesizing music that corresponds to emotional categories.

[0645] "Means for integrating generated lyrics and music to create an album" refers to a technology or method for combining created lyrics and music and structuring them in an album format along with visual elements.

[0646] "Means for incorporating image information into an album" refers to a method or apparatus for incorporating user-provided images into a digital album to create a completed album.

[0647] To implement this invention, the user begins by inputting emotional and image information into the system via a terminal. The terminal receives voice input and utilizes speech recognition technology to convert the speech into text. This technology utilizes the Google Cloud Speech-to-Text API.

[0648] Data sent from the terminal is aggregated on the server. The server analyzes the received text data using natural language processing technology and identifies the emotion category. Python and the natural language processing library NLTK are used for the analysis. Once the emotion category is identified, the server automatically generates the corresponding lyrics using a generative AI model. OpenAI's GPT model is used for lyric generation.

[0649] Next, the server generates appropriate music using music generation technology. Music generation libraries such as Magenta are useful for genre selection. This creates music that matches the specified emotional category.

[0650] The generated lyrics and music are integrated on a server, and combined with image information to complete the digital album. Media manipulation libraries such as FFmpeg are used for this integration. Finally, the album is sent to the user's device, where it can be viewed or shared.

[0651] For example, if a user inputs something like, "I want to record the fun memories of the family picnic I had today," the audio is converted to text, and the emotion category "fun" is identified. Based on this, cheerful and upbeat lyrics and music are generated, and a digital album is created along with photos. A prompt like the following might be used: "I had a picnic with my family today. It was so much fun. I want to record those memories. The emotion category is 'fun,' and I want lyrics and music that convey the feeling of the picnic."

[0652] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0653] Step 1:

[0654] The device receives voice input from the user. The user provides image information they want to record along with a specific emotion. The voice data is converted into text data using the Google Cloud Speech-to-Text API. The input is voice data, and the output is text data.

[0655] Step 2:

[0656] Text data is sent to a server, which then uses natural language processing techniques to analyze it. Python and the NLTK library are utilized to identify sentiment categories. The input is text data, and the output is sentiment categories.

[0657] Step 3:

[0658] The server generates relevant lyrics based on emotion categories using an OpenAI GPT model. The input is an emotion category, and the output is the generated lyrics. In this process, the generative AI model understands the emotion category used as a prompt and creates appropriate lyrics.

[0659] Step 4:

[0660] The server generates music based on emotion categories. It uses music generation libraries such as Magenta to compose music that matches the emotion. The input is the emotion category, and the output is the generated music.

[0661] Step 5:

[0662] The generated lyrics and music are integrated on the server. Image information is also incorporated as part of the album. Media manipulation libraries such as FFmpeg are used to integrate them into a slideshow format. The input consists of lyrics, music, and image information, and the output is a digital album.

[0663] Step 6:

[0664] The server sends the completed digital album to the user's device. The user can view and save the digital album on their device. The input is the digital album, and the output is the album displayed on the user's device.

[0665] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0666] The present invention aims to recognize a user's emotions in more detail and generate music that aligns with those emotions. To this end, the system incorporates an emotion engine that can recognize emotions by analyzing not only the emotional information input by the user, but also the user's facial expressions, voice tone, and biometric information. Because this emotion engine handles multiple sensory input data, it achieves highly accurate and reliable emotion recognition.

[0667] Users access the system using their devices and input their emotions and photos. Furthermore, the emotion engine uses the camera and microphone to capture and analyze the user's facial expressions and voice in real time. The device then sends this data to a server. The server aggregates and processes the transmitted data, using natural language processing techniques and machine learning algorithms to identify emotion categories. This process makes it possible to create music that takes into account not only the user's temporary emotions but also emotional changes and complex emotional patterns.

[0668] The server uses a generative AI to create appropriate lyrics based on the emotion category identified by the emotion engine. The music generation process considers the emotion category and corresponding music genre to generate a melody and rhythm that best fits the current emotion. The generated content is integrated into a single album along with image information provided by the user.

[0669] For example, if a user feels happy at a sporting event, the emotion engine analyzes not just the emotion of "happiness," but also the smile on their face and the excited tone of their voice, which are expressed in real time. As a result, the server generates a music album containing lively pop music and lyrics like an anthem, and provides it to the user's device. In this way, it is possible to create a music album that is more deeply intertwined with the user's experience and emotions.

[0670] This invention enhances the user experience and enables music generation that closely responds to individual emotions. This system enriches everyday emotions and allows for more immersive memory recording.

[0671] The following describes the processing flow.

[0672] Step 1:

[0673] The user accesses the system using a device. The user inputs photos related to their emotions and the situation at the time, and activates the camera and microphone as needed. The device sends the input text data and photo data to the server.

[0674] Step 2:

[0675] With the user's permission, the device captures the user's facial expressions and voice tone in real time via the camera and microphone. This data is processed by an emotion engine for emotion recognition and then sent to a server.

[0676] Step 3:

[0677] The server analyzes incoming text, photos, and real-time data from the emotion engine. It uses natural language processing techniques to analyze text data and identify emotion categories. It also extracts additional emotion information from facial expression data and voice tone to determine an overall emotion category.

[0678] Step 4:

[0679] The server uses AI to automatically generate appropriate lyrics based on identified emotion categories. At this stage, appropriate words that fit the emotional tone are selected.

[0680] Step 5:

[0681] The server takes into account the emotion category and selected music genre, and uses music generation AI to create original melodies and accompaniments. The generated melodies are designed to best harmonize with the user's emotions.

[0682] Step 6:

[0683] The server integrates the generated lyrics and music, and creates a visual album that includes photos submitted by the user. This results in a digital content that integrates auditory and visual elements.

[0684] Step 7:

[0685] The server sends the completed album to the user's device. The device then displays this digital album so that the user can view, save, or share it with other users.

[0686] (Example 2)

[0687] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0688] Traditional music generation systems evaluate user emotions based on simple text input, making it difficult to generate personalized music that reflects subtle nuances and changes in those emotions. This limits the user experience and fails to provide a musical environment that fully resonates with their feelings.

[0689] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0690] In this invention, the server includes means for receiving emotional information, means for capturing facial expression data and recording voice tone, means for analyzing the emotional information, facial expression data, and voice tone to identify an emotional category, and means for generating lyrics based on the emotional category using a generative AI model. This makes it possible to recognize the user's diverse and subtle emotions with high accuracy and generate music and lyrics individually to match them.

[0691] "Emotional information" refers to input data in which a user expresses their internal emotional state in text or other forms.

[0692] "Image information" refers to images and photographic data related to the user, and is used as an element in interpreting emotions.

[0693] "Facial expression data" refers to data about the user's facial expressions captured through a camera.

[0694] "Voice tone" refers to data related to audio information obtained from the user's speaking style and voice quality.

[0695] An "emotional category" is a classification that indicates the emotional state of a user, identified from the analyzed emotional information.

[0696] A "generative AI model" is an artificial intelligence system used to generate music and lyrics based on emotional information.

[0697] "Musical style" refers to a specific musical genre or characteristic that corresponds to an emotional category.

[0698] This invention relates to a system that analyzes a user's emotions in a multidimensional manner and generates music based on that analysis. The user inputs their emotional information into the system through a terminal, and also provides facial expression data and voice tone using a camera and microphone. The terminal collects this data and transmits it to a server.

[0699] The server analyzes the received data using an emotion engine. Natural language processing techniques and machine learning algorithms are employed to identify emotion categories from the input emotion information, facial expression data, and voice tone. These emotion categories reflect specific emotional states (e.g., "joy," "excitement").

[0700] Next, the server uses a generative AI model to generate lyrics and music based on this emotion category. The generation prompt is an instruction to the AI such as, "Create a song that matches the user's emotions." The generative AI model considers the musical style corresponding to the emotion category and determines the melody and rhythm of the music.

[0701] The generated music and lyrics are integrated and combined with user-provided image information to create a digital album. This album is delivered to the user's device, providing an emotionally rich experience through both sight and sound. For example, if a user feels "excitement" at a sporting event, the server creates an energetic pop song and anthem-like lyrics, which are then delivered as a music album.

[0702] This system generates personalized music content in real time that is fully responsive to the user's emotions, allowing them to enjoy a deeper emotional experience.

[0703] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0704] Step 1:

[0705] Users access the system using a device and input their emotions as text or upload photos. The device collects data by capturing the user's facial expressions with its camera and recording their voice tone with its microphone. This process yields emotional information, facial expression data, and voice data as input data. This input data is packaged and sent to the server.

[0706] Step 2:

[0707] The server receives emotional information, facial expression data, and voice data transmitted from the terminal. This data is analyzed by an emotion engine to comprehensively evaluate the user's emotions. This analysis uses natural language processing techniques to extract emotions from text and machine learning algorithms to identify emotion categories from facial expression and voice data. The analysis results output an emotion category (e.g., "joy," "excitement").

[0708] Step 3:

[0709] The server generates appropriate lyrics using a generative AI model based on the emotion categories identified through analysis. Here, a prompt sentence based on the emotion category (e.g., "Create a song that matches the user's feelings of joy") is supplied to the generative AI model. Based on this prompt, the AI model generates appropriate lyrics, and those lyrics are output.

[0710] Step 4:

[0711] The server generates music based on the same emotion category. The generation AI model links emotion categories with musical styles to create the most appropriate melody and rhythm. This process outputs a music file that matches the emotion.

[0712] Step 5:

[0713] The server integrates the generated lyrics and music, thereby forming the music content. Simultaneously, it integrates image information provided by the user to create a personalized digital album. This album data becomes the final output.

[0714] Step 6:

[0715] The server delivers a personalized digital album to the user's device. The user can then play this digital album and enjoy their emotional experience visually and aurally.

[0716] (Application Example 2)

[0717] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0718] This invention aims to improve the user experience by generating music based on the user's emotions. However, existing systems have difficulty precisely identifying the user's facial expressions and tone of voice and providing music that matches their emotions in real time. Furthermore, these systems lack means to deepen interaction with the user when the robot plays music.

[0719] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0720] In this invention, the server includes means for receiving emotional information, means for receiving image information, and means for receiving audio information. This makes it possible to recognize the user's complex emotions with high accuracy and generate music and lyrics that match those emotions. Furthermore, by having the robot generate actions in accordance with the generated music, it becomes possible to enrich the interaction with the user.

[0721] "Emotional information" refers to data about emotions obtained based on the user's facial expressions, tone of voice, and other biometric data.

[0722] "Image information" refers to digital image or video data that includes the user's visual characteristics and environmental information.

[0723] "Audio information" refers to audio data that includes the tone and rhythm of the user's voice.

[0724] An "emotion category" is a type of emotion classified based on analyzed emotional information.

[0725] "Methods for generating lyrics" refer to techniques for constructing appropriate words and phrases based on emotional categories.

[0726] "Means of generating music" refers to techniques for creating melodies and rhythms that correspond to emotional categories.

[0727] "The means of creating an album" refers to the technology that integrates the generated lyrics and music into a single cohesive whole.

[0728] "Means for generating motion" refers to the technology for designing robot movements that are synchronized with music.

[0729] The system used to realize this application features the ability to recognize the user's emotions in detail and generate music accordingly. Users access the system via a smartphone or home robot terminal and input their emotions and image information. The terminal uses a camera and microphone to capture the user's facial expressions and voice tone in real time. This data is then transmitted from the terminal to a server.

[0730] The server analyzes the emotion category based on the received emotion and voice information. Natural language processing technology and emotion analysis algorithms are used for the analysis. This analysis makes it possible to recognize the complex patterns of the user's emotions with high accuracy.

[0731] Furthermore, the server uses an AI model to generate appropriate lyrics based on the emotional category and then produces music. The music genre is automatically selected based on the emotional category. The generated lyrics and music are integrated into a single album and provided to the user along with image information.

[0732] In addition, the home robot generates movements that match the music being played. The robot can move in sync with the generated music to provide an immersive experience similar to a live performance.

[0733] As a concrete example, imagine a scenario where, upon a user returning from work, the system recognizes their tired expression, generates relaxing music and lyrics, and a household robot performs soothing movements in time with the music. In this way, the individual user experience can be improved.

[0734] An example of a prompt to input into a generative AI model is: "Build a neural network that generates optimal relaxation music based on the user's emotions. Take into account the user's facial expressions and tone of voice."

[0735] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0736] Step 1:

[0737] The device uses a camera and microphone to capture the user's facial expressions and voice. The input data consists of image and audio information, which is collected in real time. This allows the device to acquire biometric information such as the user's facial expressions and voice tone.

[0738] Step 2:

[0739] The device sends the acquired image and audio information to the server as a dataset. The input here is the unprocessed image and audio data sent from the device. The server receives this and passes it on to the next analysis step.

[0740] Step 3:

[0741] The server analyzes the received image and audio information to extract emotional information. Specifically, it uses natural language processing and facial recognition algorithms to identify emotional categories. The input is image and audio data, and the output is emotional categories. This operation identifies the emotional state of the user.

[0742] Step 4:

[0743] The server generates lyrics using a generative AI model based on emotion categories. Emotion categories are used as input, and corresponding lyric data is generated as output. The AI model creates linguistic expressions appropriate to the emotions expressed.

[0744] Step 5:

[0745] The server similarly generates music based on emotion categories. The input is an emotion category, and the output is a corresponding music track. The music generation algorithm automatically creates melodies and rhythms that match the emotion.

[0746] Step 6:

[0747] The server integrates the generated lyrics and music to create an album. Image information is also incorporated, completing a single listening experience. The inputs here are lyrics, music, and image information, and the output is the integrated album. This integration enables a personalized music experience for each user.

[0748] Step 7:

[0749] The robotic device plays the music from the album and generates actions. Emotional information received as user input is reused here, and the robot performs interactive actions synchronized with the music. This synchronizes the music and movement, providing a more immersive experience.

[0750] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0751] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0752] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0753] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0754] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0755] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0756] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0757] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0758] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0759] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0760] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0761] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0762] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0763] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0764] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0765] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0766] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0767] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0768] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0769] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0770] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0771] The following is further disclosed regarding the embodiments described above.

[0772] (Claim 1)

[0773] Means of receiving emotional information,

[0774] Means of receiving image information,

[0775] A means for analyzing the aforementioned emotional information and identifying emotional categories,

[0776] Means for generating lyrics based on the aforementioned emotion categories,

[0777] Means for generating music based on the aforementioned emotional categories,

[0778] A means of integrating generated lyrics and music to create an album,

[0779] A means for incorporating image information into the aforementioned album,

[0780] A system that includes this.

[0781] (Claim 2)

[0782] The system according to claim 1, which utilizes natural language processing technology for analyzing the aforementioned emotional information.

[0783] (Claim 3)

[0784] The system according to claim 1, which selects a music genre corresponding to an emotional category in the generation of the aforementioned music.

[0785] "Example 1"

[0786] (Claim 1)

[0787] A device for inputting emotional data,

[0788] A device for inputting image data,

[0789] A processing device that analyzes the aforementioned emotion data and identifies emotion categories,

[0790] A processing device that generates lyrics using generation AI technology based on the aforementioned emotion categories,

[0791] A processing device that generates music using a music generation algorithm based on the aforementioned emotion categories,

[0792] A processing device that integrates generated lyrics and music to produce a digital album,

[0793] A processing device for incorporating image data into the aforementioned digital album,

[0794] A system that includes this.

[0795] (Claim 2)

[0796] The system according to claim 1, wherein natural language processing technology is applied to the analysis of the aforementioned emotion data.

[0797] (Claim 3)

[0798] The system according to claim 1, which selects a musical style that fits the emotional category in the generation of the aforementioned music.

[0799] "Application Example 1"

[0800] (Claim 1)

[0801] Means of receiving emotional information,

[0802] Means of receiving image information,

[0803] A means of converting speech to text,

[0804] A means for analyzing the aforementioned emotional information and identifying emotional categories,

[0805] Means for generating lyrics based on the aforementioned emotion categories,

[0806] Means for generating music based on the aforementioned emotional categories,

[0807] A means of integrating generated lyrics and music to create an album,

[0808] A means for incorporating image information into the aforementioned album,

[0809] A system that includes this.

[0810] (Claim 2)

[0811] The system according to claim 1, which utilizes natural language processing technology for analyzing the aforementioned emotional information.

[0812] (Claim 3)

[0813] The system according to claim 1, which selects a music genre corresponding to an emotional category in the generation of the aforementioned music.

[0814] "Example 2 of combining an emotion engine"

[0815] (Claim 1)

[0816] Means of receiving emotional information,

[0817] Means of receiving image information,

[0818] A means of capturing facial expression data and recording voice tone,

[0819] A means for analyzing the aforementioned emotional information, facial expression data, and voice tone to identify an emotional category,

[0820] A means for generating lyrics based on the aforementioned emotion category using a generative AI model,

[0821] Means for generating music based on the aforementioned emotional categories,

[0822] A means of integrating generated lyrics and music to create an album,

[0823] A means for incorporating image information into the aforementioned album,

[0824] A means of distributing the generated album to users,

[0825] A system that includes this.

[0826] (Claim 2)

[0827] The system according to claim 1, which utilizes natural language processing technology for analyzing the aforementioned emotional information and facial expression data.

[0828] (Claim 3)

[0829] The system according to claim 1, wherein in generating the aforementioned music, a musical style corresponding to an emotional category is selected.

[0830] "Application example 2 when combining with an emotional engine"

[0831] (Claim 1)

[0832] Means of receiving emotional information,

[0833] Means of receiving image information,

[0834] Means of receiving audio information,

[0835] A means for analyzing the aforementioned emotional information and identifying emotional categories,

[0836] Means for generating lyrics based on the aforementioned emotion categories,

[0837] Means for generating music based on the aforementioned emotional categories,

[0838] A means of integrating generated lyrics and music to create an album,

[0839] A means for incorporating image information into the aforementioned album,

[0840] A means for generating motion when the robot device plays the music,

[0841] A system that includes this.

[0842] (Claim 2)

[0843] The system according to claim 1, which utilizes natural language processing technology for analyzing the aforementioned emotional information.

[0844] (Claim 3)

[0845] The system according to claim 1, which selects a music genre corresponding to an emotional category in the generation of the aforementioned music. [Explanation of symbols]

[0846] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. Means of receiving emotional information, Means of receiving image information, A means for analyzing the aforementioned emotional information and identifying emotional categories, Means for generating lyrics based on the aforementioned emotion categories, Means for generating music based on the aforementioned emotional categories, A means of integrating generated lyrics and music to create an album, A means for incorporating image information into the aforementioned album, A system that includes this.

2. The system according to claim 1, which utilizes natural language processing technology for analyzing the aforementioned emotional information.

3. The system according to claim 1, wherein in generating the aforementioned music, a music genre corresponding to an emotional category is selected.