system

The system addresses the lack of personalized and emotional feedback in voice training by analyzing user data to generate optimized practice plans and provide real-time feedback, improving singing ability and emotional expression.

JP2026105371APending Publication Date: 2026-06-26SOFTBANK GROUP CORP

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
SOFTBANK GROUP CORP
Filing Date
2024-12-16
Publication Date
2026-06-26

Smart Images

  • Figure 2026105371000001_ABST
    Figure 2026105371000001_ABST
Patent Text Reader

Abstract

We provide the system. [Solution] A data acquisition method for acquiring user voice data, An analysis means for analyzing the aforementioned audio data and evaluating the user's acoustic characteristics and vocalization characteristics, A plan generation means that generates a practice plan customized for the user based on the results evaluated by the analysis means, A feedback provision means provides real-time feedback based on the practice plan generated by the aforementioned plan generation means, A monitoring system that monitors the user's practice history and growth trends, and visualizes progress data, In a scenario where a user practices vocal exercises using a home-use device, a means for operating a home-use device provides the user with immediate feedback based on sound and rhythm, A system that includes this.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor, and includes steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a character of the chatbot, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] An object of the present invention is to address the problem of insufficient objective evaluation in voice pronunciation training, and to provide a technology that enables the provision of a practice plan optimized for individual users, real-time feedback, and long-term growth monitoring. Another object is to continuously update the plan based on the latest trends in the music industry to support the health management and emotional expression of users' voices.

Means for Solving the Problems

[0005] The system of this invention collects user voice data and evaluates the user's voice quality and vocal characteristics through AI analysis. Based on the evaluation results, it generates a personalized practice plan and provides real-time feedback to the user. It also monitors the user's practice history and growth trends, and visualizes progress to improve practice efficiency. Furthermore, it acquires the user's heart rate and breathing data in real time and updates the practice plan considering trends in the music industry, thereby supporting long-term vocal health management and improvement of emotional expression.

[0006] A "user" refers to a person who provides voice data and uses the system.

[0007] "Voice data" refers to data obtained by capturing the voice spoken by a user as an electrical signal.

[0008] "Data collection means" refers to a device or process for acquiring voice data from a user.

[0009] "Analysis means" refers to a device or process for analyzing the content of audio data and evaluating the user's voice quality and vocal characteristics.

[0010] "Generation means" refers to a device or process that creates a personalized practice plan based on the analysis results.

[0011] "Feedback means" refers to a device or process that provides users with real-time evaluations and suggestions for improvement regarding their vocalizations.

[0012] "Monitoring means" refers to a device or process that tracks a user's practice history and growth trends, and collects and analyzes data.

[0013] "Real-time" refers to the characteristic where audio data is processed and feedback is given simultaneously with its generation.

[0014] The "practice plan" refers to a series of practice contents and exercises designed individually for the purpose of improving the user's voice.

[0015] "Trend" refers to the latest trends and styles that are common in the music industry at a specific time.

Brief Explanation of Drawings

[0016] [Figure 1] It is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] It is a conceptual diagram showing an example of the main functions of a data processing device and a smart device according to the first embodiment. [Figure 3] It is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which a plurality of emotions are mapped. [Figure 10] It shows an emotion map to which a plurality of emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13]It is a sequence diagram showing the processing flow of the data processing system in Example 2 when the emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when the emotion engine is combined.

Mode for Carrying Out the Invention

[0017] Hereinafter, an example of an embodiment of the system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0018] First, the terms used in the following description will be explained.

[0019] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0020] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0021] In the following embodiments, the numbered storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, etc.

[0022] In the following embodiments, the signed communication interface (I / F) is an interface that includes a communication processor and an antenna, etc. The communication interface manages communication between multiple computers. Examples of communication standards applicable to the communication interface include wireless communication standards such as 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).

[0023] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0024] [First Embodiment]

[0025] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0026] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0027] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0028] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0029] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0030] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0031] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0032] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0033] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0034] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0035] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0036] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0037] The system according to the present invention provides highly personalized practice plans and feedback to support the user's improvement of singing ability. This enables efficient practice based on the user's vocal characteristics, supporting long-term growth and vocal health management.

[0038] The system first collects the user's voice data via the terminal. When the user speaks into the terminal, the voice data is sent to the server in real time. On the server, the received voice data is analyzed by an AI voice analysis engine, and vocal characteristics such as voice quality, pitch, and rhythm are evaluated in detail. Based on this evaluation result, an optimal practice plan is generated for the user.

[0039] The generated practice plan is sent from the server to the terminal and presented to the user through the terminal. This practice plan includes specific exercises and goals that the user should work on, and the degree of achievement and areas for improvement are shown through real-time feedback. Feedback is provided immediately, for example, when the user sings off-key, and is communicated to the user in visual or audible form.

[0040] In addition, the system monitors the user's heart rate and respiratory data in real time and suggests relaxation techniques and warm-up exercises as needed. This information is acquired via sensors, analyzed on a server, and then transmitted to the terminal.

[0041] The system also accumulates the user's practice history and analyzes long-term growth trends. Based on this analysis, the server generates a dashboard that visualizes the user's progress. This dashboard is presented to the user via their device, motivating them to feel a sense of their own progress.

[0042] For example, in the case of a user practicing opera, the system creates a practice plan focused on a specific part of an aria based on the results of the previous practice session. If there are any problems with pitch or rhythm during practice, feedback is provided instantly, allowing for immediate correction. Through this process, the user can continue practicing while checking their own progress.

[0043] The following describes the processing flow.

[0044] Step 1:

[0045] The user speaks into the device and selects the song or scale they want to practice. The device captures this voice and collects the audio data in real time.

[0046] Step 2:

[0047] The terminal sends the collected audio data to the server. The server inputs the received audio data into the AI ​​voice analysis engine.

[0048] Step 3:

[0049] The server uses an AI voice analysis engine to analyze the voice data. This analysis evaluates vocal characteristics such as voice quality, pitch, volume, and rhythm.

[0050] Step 4:

[0051] The server generates a practice plan optimized for the user based on the analysis results. This practice plan includes specific exercises and areas for improvement.

[0052] Step 5:

[0053] The server sends the generated practice plan to the device. The device receives the practice plan and presents it to the user. This may be displayed visually using icons and text, and voice guidance is also possible.

[0054] Step 6:

[0055] The user follows the instructions on the device to perform exercises, recording heart rate and respiratory data during the process. This data is sent to the device via sensors.

[0056] Step 7:

[0057] The device sends practice status to the server in real time. The server generates feedback information and points out areas where the user needs to improve.

[0058] Step 8:

[0059] The device provides the user with feedback received from the server. Specifically, it visually or audibly points out issues such as pitch discrepancies or rhythmic inconsistencies.

[0060] Step 9:

[0061] The server accumulates user practice data and analyzes long-term growth trends. Based on this analysis, it generates a dashboard that visualizes progress.

[0062] Step 10:

[0063] Users can check the dashboard through their device to understand their progress and achievements, and use that information to practice towards their next goals.

[0064] (Example 1)

[0065] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0066] Conventional singing practice systems lacked sufficient practice planning and feedback based on individual user characteristics, making them ineffective in supporting growth. Furthermore, real-time evaluation necessary for improving musical expression and optimization of practice based on physiological states were difficult.

[0067] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0068] In this invention, the server includes means for collecting user voice data and transmitting the digitized voice information to the server; means for an AI voice analysis engine to evaluate voice characteristics and musical indicators using the voice information; and means for creating a user-specific practice strategy based on the evaluation results obtained by the analysis means. This enables the user to receive a personalized practice plan and real-time feedback, resulting in effective improvement of singing ability.

[0069] "Data collection means" refers to a device or software that has the function of capturing the user's voice, converting it into a digital format, and transmitting it to a server.

[0070] "Analysis means" refers to a system that uses a voice analysis engine on the server side to evaluate the voice quality and musical indicators of voice data.

[0071] A "generation method" refers to a system that has the function of automatically creating practice strategies tailored to individual users based on analyzed data.

[0072] A "feedback mechanism" is a device or function that provides real-time evaluation and suggestions for improvement of the user's practice and performance based on the generated practice strategy.

[0073] A "monitoring system" is a system that monitors the user's physiological data and recommends appropriate warm-up exercises according to their physical and mental state.

[0074] A "visualization tool" is a system that visually displays a user's practice progress and long-term growth, and has functions to boost motivation.

[0075] In this invention, the user operates a system that provides a voice practice environment. The user begins by collecting voice data using the microphone of a dedicated terminal. This terminal is equipped with high-performance voice recognition technology that converts the collected voice into digital data in real time and transmits it to a server over a network.

[0076] Upon receiving audio data, the server utilizes an AI voice analysis engine to analyze the speech. This analysis engine evaluates musical indicators such as voice quality, pitch, and rhythm, and based on these, understands the user's vocal characteristics. Based on the analysis results, the server generates a practice strategy optimized for the user's characteristics. This generation process uses voice processing algorithms and generative AI models.

[0077] The generated practice strategy is sent from the server to the terminal and presented to the user through the terminal's interface. This strategy includes specific exercises and goals, allowing the user to practice while receiving real-time feedback. This feedback is provided as visual and auditory instructions to help with pitch adjustments and rhythm improvements.

[0078] Furthermore, the device can monitor physiological indicators such as heart rate and respiration using sensors. The server analyzes this data and recommends relaxation techniques and warm-up exercises as needed. This reduces the physical burden on the user and provides a safe and effective training environment.

[0079] Furthermore, the server stores the user's practice history in a database and generates a dashboard that visualizes their progress. Users can check this progress information through their devices, which can help them maintain their motivation.

[0080] For example, a user practicing opera could use a prompt like this: "I want to practice opera arias. For this practice session, I want to improve my pitch and rhythm accuracy, so please provide a practice plan and feedback specifically tailored to that."

[0081] This system comprehensively provides a series of processes, from collecting and analyzing audio data to generating practice strategies and providing feedback, supporting the efficient improvement of users' singing abilities.

[0082] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0083] Step 1:

[0084] The user speaks using the device's microphone. The device converts this voice into digital audio data and sends it to the server over the network. The input for this step is the user's raw voice, and the output is digital audio data. Specifically, the device's built-in A / D converter converts the analog audio into a digital signal.

[0085] Step 2:

[0086] The server processes the received digital audio data through an AI voice analysis engine. The analysis engine analyzes the audio data, extracting musical characteristics such as voice quality, pitch, and rhythm, and evaluating them as numerical data. The input for this step is digital audio data, and the output is numerical data representing the user's vocal characteristics. Specifically, a feature extraction algorithm calculates the necessary indicators from the voice waveform.

[0087] Step 3:

[0088] The server generates a personalized practice strategy for each user based on the analyzed vocal characteristics. Here, a generative AI model is used to create a practice plan that includes exercises designed to strengthen the user's weaknesses and areas for improvement. The input for this step is numerical data on the user's vocal characteristics, and the output is a specific practice plan. Specifically, the AI ​​model recommends the most suitable practice content based on past data.

[0089] Step 4:

[0090] The generated practice plan is sent from the server to the terminal and presented to the user visually and audibly. The input here is the practice plan data, and the output is the content presented to the user on the terminal. Specifically, the user interface receives the plan information and performs actions such as screen display and audio output.

[0091] Step 5:

[0092] When the user begins practicing, the device again collects audio and sends it to the server to generate scientific feedback. The server uses an AI analysis engine to perform real-time evaluations and provide feedback on pitch deviations and rhythmic irregularities. The input for this step is the user's audio data during practice, and the output is real-time feedback to the user. Specifically, the feedback content is calculated and sent to the device immediately.

[0093] Step 6:

[0094] The device collects the user's heart rate and respiratory data using sensors and sends it to the server. The server analyzes this physiological data and suggests relaxation and warm-up exercises as needed. The input for this step is physiological data, and the output is suggested exercises. Specifically, the server assesses the stress level and selects relaxation methods.

[0095] Step 7:

[0096] The server continuously records the user's practice history and generates a dashboard that visualizes their progress. The dashboard is displayed on the device, showing the user their progress. The input for this step is the user's practice data, and the output is visualized progress information. Specifically, it extracts information from the database and generates graphs.

[0097] (Application Example 1)

[0098] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0099] For users to efficiently improve their singing ability, personalized practice plans and immediate feedback are crucial. However, conventional technology struggles to provide such a high level of personalization, and there is a particular lack of systems to support practice in home environments. Furthermore, the absence of mechanical devices to effectively support users during vocal practice is also a problem.

[0100] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0101] In this invention, the server includes data acquisition means for acquiring user voice data, analysis means for analyzing the voice data and evaluating the user's acoustic and vocal characteristics, and plan generation means for generating a practice plan customized for the user based on the results evaluated by the analysis means. This provides the user with a suitable practice plan and real-time feedback, making it possible to effectively improve singing ability even in a home environment.

[0102] "Data acquisition means" refers to functions or devices for acquiring user voice data.

[0103] "Analysis means" refers to functions or devices that analyze acquired audio data and evaluate the user's acoustic characteristics and vocalization characteristics.

[0104] A "plan generation means" is a function or device that generates a customized practice plan for the user based on the evaluation results obtained by the analysis means.

[0105] A "feedback provision means" refers to a function or device that provides real-time feedback to the user based on the generated practice plan.

[0106] "Monitoring means" refers to functions or devices that monitor a user's practice history and growth trends, and visualize progress data.

[0107] "Household mechanical device operating means" refers to a function or device that uses household mechanical devices to provide users with immediate feedback based on sound and rhythm when they practice vocal exercises.

[0108] This invention relates to a system that acquires and analyzes user voice data to effectively support user voice practice and generates and provides a personalized practice plan. The system includes data acquisition means for collecting user voice data, which is acquired in real time using a microphone. The acquired voice data is transmitted to a server and analyzed by an AI voice analysis engine.

[0109] The server evaluates the user's acoustic and vocal characteristics through the analysis of audio data. Specifically, it uses analysis software developed in Python or C++, as well as libraries such as TENSORFLOW® and PyTorch. Based on the analysis results, a practice plan suitable for the user is generated by the plan generation mechanism.

[0110] The generated training plan provides real-time feedback to the user via the device. This feedback is presented visually or audibly through the user's home-use device control system. The user's heart rate and respiratory data are also collected via sensors and incorporated into suggestions for relaxation techniques and other related data.

[0111] As a concrete example, consider a child using this system to practice singing at home. When the child practices singing, the system provides immediate feedback such as, "The pitch is a little high, try lowering it a bit," if the pitch or rhythm is off.

[0112] An example of a prompt to the generating AI model is, "Analyze the user's voice data in real time, evaluate pitch and voice quality, and generate an optimal practice plan." This allows users to continuously improve their singing ability even in a home environment.

[0113] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0114] Step 1:

[0115] When a user begins singing using the device, the device collects audio data via its microphone. The user's voice waveform is obtained as input and sent to the server.

[0116] Step 2:

[0117] The server passes the received audio data to an AI speech analysis engine, which analyzes the acoustic and vocal characteristics. Specifically, it uses FFT (Fast Fourier Transform) to analyze the frequency components of the sound field and extract pitch and rhythm. The output of this analysis is data on the quality, pitch, and rhythm of the user's voice.

[0118] Step 3:

[0119] Based on the analysis results, the server generates a customized practice plan for the user using a plan generation mechanism. The input is audio analysis data, and the output is a list of practice items and goals suitable for the user. This process is performed using a generation AI model, and prompts such as "Analyze the user's audio data in real time, evaluate pitch and voice quality, and generate the optimal practice plan" are used to determine practice content that takes the user's progress into consideration.

[0120] Step 4:

[0121] The device receives the generated practice plan and provides real-time presentation and feedback to the user. Specifically, real-time guidance is provided to the user through visual displays and voice messages. Voice data and the practice plan are used as input for feedback, and progress and areas for improvement are communicated to the user as output.

[0122] Step 5:

[0123] The device monitors the user's heart rate and respiratory data in real time and suggests relaxation techniques as needed. Input is data acquired by physiological data sensors, and output provides the user with instructions and suggestions regarding relaxation.

[0124] Step 6:

[0125] The server stores the user's practice history and monitors long-term growth trends. Past practice data is stored on the server as input, and information visualizing the growth trend is generated on a dashboard for the user as output. This allows users to visually understand their own progress.

[0126] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0127] This invention enhances the effectiveness of vocal training by adding a function to recognize the user's emotions to a system that provides users with personalized practice plans and real-time feedback during voice training. This aims to improve not only the quality of the user's voice but also their overall singing ability, including emotional expression.

[0128] The system first collects the user's voice and biometric data via the terminal. When the user speaks into the terminal, the voice data is sent to the server. Once the voice data reaches the server, it is input into an AI voice analysis engine, where the voice quality and vocal characteristics are analyzed in detail.

[0129] In addition, this system incorporates an emotion engine that analyzes the user's emotional state from voice data. The emotion engine identifies the user's emotions based on factors such as voice intonation and rhythm, as well as heart rate fluctuations obtained from biometric data.

[0130] Based on the analysis results, the server generates a personalized practice plan tailored to the user's emotions and voice quality. This practice plan includes specific exercises that utilize the user's current emotional state, enabling them to improve their emotional expression.

[0131] The generated practice plan is sent from the server to the device and presented to the user on the device. The device also receives feedback from the server during practice and notifies the user in real time. The feedback includes not only areas for improvement in voice but also evaluations of the emotions being expressed.

[0132] For example, if a user is practicing an opera aria, the system will grasp the emotions embedded in the piece and provide a practice plan based on that. If the user struggles with expressing those emotions at a certain stage, the system can provide precise advice based on the analysis results of its emotion engine. In this way, users can hone not only their musical technique but also their expressive abilities simultaneously.

[0133] The following describes the processing flow.

[0134] Step 1:

[0135] The user performs an emotionally charged voice response to the device. The device collects biometric data, such as the user's heart rate, along with the voice data.

[0136] Step 2:

[0137] The device sends the collected voice and biometric data to the server. The server inputs the received data into its AI voice analysis engine and begins the analysis.

[0138] Step 3:

[0139] The server performs speech analysis to evaluate vocal characteristics such as voice quality, pitch, and rhythm. Simultaneously, it uses an emotion engine to analyze the user's emotional state from the speech data. This includes changes in intonation and rhythm, as well as fluctuations in biometric data.

[0140] Step 4:

[0141] Based on the analysis results, the server generates a practice plan optimized for the user's emotions and vocal characteristics. This plan includes specific exercises that take the user's emotional state into account, as well as content aimed at improving expressive techniques.

[0142] Step 5:

[0143] The server sends the generated practice plan to the terminal. The terminal receives the practice plan and presents it to the user visually and audibly. The user performs vocal exercises according to the instructed practice plan.

[0144] Step 6:

[0145] While the user is practicing, the device continuously collects voice and biometric data and sends it to the server in real time. The server generates feedback based on this data.

[0146] Step 7:

[0147] The server generates feedback, including technical improvements to the user's voice and advice on emotional expression. This feedback includes specific suggestions for improvement and advice on how to express emotions effectively.

[0148] Step 8:

[0149] The device provides users with real-time feedback from the server, allowing them to immediately apply it to their practice. For example, it might suggest specific ways to emphasize areas where emotional expression is lacking.

[0150] Step 9:

[0151] The server accumulates all of the user's practice data over the long term and analyzes changes in emotional state and technical growth trends. This data is provided to the user and can be used to improve their next practice session.

[0152] Based on this feedback and progress data, users can set new goals to improve their expressive abilities and continue practicing.

[0153] (Example 2)

[0154] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0155] Conventional voice training systems primarily focused on voice quality and technical aspects, limiting their ability to improve users' emotional expression. Furthermore, they faced challenges in appropriately evaluating the impact of the user's physical and mental state on the training's effectiveness and incorporating this into the training plan.

[0156] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0157] In this invention, the server includes data collection means for acquiring the user's voice data and biometric data, analysis means for analyzing voice and emotions and evaluating the user's voice quality and emotional state, and generation means for generating a practice plan that utilizes the user's emotions. This makes it possible to improve the user's voice quality and simultaneously enhance their emotional expression.

[0158] "Data collection means" refers to a device or method for acquiring a user's voice data and biometric data.

[0159] "Analysis methods" refer to processes and technologies for evaluating the user's voice quality and vocal characteristics from acquired audio data, and further analyzing their emotions.

[0160] "Emotional analysis means" refers to a device or method for identifying a user's emotional state based on voice intonation and biometric data.

[0161] "Generation means" refers to an apparatus or method for generating a personalized practice plan for a user based on the analysis results.

[0162] A "feedback device" is a device or method that provides real-time evaluation of voice and emotional expression in accordance with the generated practice plan.

[0163] "Monitoring means" refers to a device or method for tracking a user's practice history and growth trends, and for visualizing their progress.

[0164] This system aims to support users' voice training and improve their overall abilities, including emotional expression. Users input voice data through a device, and this voice data and biometric data are collected. Data collection is performed using a microphone and heart rate sensor built into the device.

[0165] The device sends this data to the server, which first uses an AI voice analysis engine to analyze the quality, tone, and rhythm of the voice. This voice analysis utilizes machine learning algorithms. Furthermore, an emotion analysis engine analyzes intonation, rhythm, vital data, etc., from the voice data to identify the user's emotional state at that moment.

[0166] Based on the analysis results, the server generates a practice plan optimized for the user. This plan includes content that takes advantage of the user's emotional state during practice. For example, if relaxation is needed, vocal exercises that promote relaxation can be incorporated.

[0167] The practice plan is sent from the server to the terminal and presented to the user. The terminal screen displays the practice steps and provides voice guidance. Voice and biometric data during practice are sent back to the server to provide real-time feedback.

[0168] For example, if a user enters a prompt such as, "I want to sing opera arias with more emotion. Please advise me on what kind of practice I should do," the system can instantly generate and present a practice plan that meets that request.

[0169] This system allows users to simultaneously improve their voice skills and emotional expression. Furthermore, by incorporating the latest trends in the music field, it allows for the continuous adoption of new training methods.

[0170] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0171] Step 1:

[0172] The user speaks into the device. The device captures the voice data with its built-in microphone and simultaneously collects biometric data such as heart rate using sensors. Both voice data and biometric data are acquired as input. This provides basic data to understand the user's vocalizations.

[0173] Step 2:

[0174] The terminal transmits the collected voice and biometric data to the server. The data is encrypted using a secure protocol and safely transferred to the server. The input here is encrypted voice and biometric data, while the output to the server is transmitted in its original data format.

[0175] Step 3:

[0176] The server inputs the received audio data into an AI speech analysis engine. This analysis engine uses deep learning algorithms to analyze the quality, tone, and rhythm of the speech in detail. The input is the audio data sent to the server, and the output is the speech characteristic information as a result of the analysis.

[0177] Step 4:

[0178] The server simultaneously analyzes voice and biometric data using an emotion analysis engine to identify the user's emotional state. The input includes voice intonation, pacing, and heart rate changes, while the output provides an evaluation of the user's emotional state.

[0179] Step 5:

[0180] The server integrates the results of voice analysis and emotion analysis to generate a personalized practice plan that reflects the user's emotions. The input is the analysis results, and the output is an individually customized practice plan. This provides the user with practice content optimized for their needs.

[0181] Step 6:

[0182] The server sends the generated practice plan to the terminal, and the terminal presents the plan to the user. At this time, the practice steps are displayed on the terminal as visual and audio guides. The input is the practice plan, and the output is the training content presented to the user.

[0183] Step 7:

[0184] The user practices according to the practice plan provided on the device. The device collects voice and biometric data again and sends it to the server in real time. The server generates feedback based on this data and notifies the user in real time through the device. The input is the user's practice data, and the output is feedback including areas for improvement and evaluation.

[0185] (Application Example 2)

[0186] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0187] Conventional voice training systems focused solely on the technical aspects of voice, failing to adequately improve users' emotional expression. Furthermore, the lack of feedback functions tailored to emotional states made it difficult for users to obtain practice plans that were expressive and emotionally responsive. This left the overall improvement of users' singing abilities as a challenge.

[0188] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0189] In this invention, the server includes data collection means for acquiring user voice data and biometric information, analysis means for analyzing the voice data, evaluating voice quality and vocalization characteristics, and recognizing emotional states, and generation means for generating a personalized practice plan based on the analysis results, including improving emotional expression. This allows the user to receive real-time feedback not only on voice technology but also on emotional expression, enabling comprehensive improvement of singing ability.

[0190] "Data collection means" refers to a device or method for acquiring a user's voice data and biometric information.

[0191] "Analysis means" refers to a device or method that uses acquired audio data to evaluate the user's voice quality and vocalization characteristics, and further recognizes the user's emotional state from the audio.

[0192] "Generation means" refers to a device or method for creating a personalized practice plan tailored to the user's emotional state based on analysis results, with the aim of improving emotional expression skills.

[0193] A "feedback device" is a device or method that provides the user with real-time suggestions for improvement in voice and emotional expression based on the generated practice plan.

[0194] "Monitoring means" refers to a device or method that tracks a user's practice history and growth trends, visualizes progress data, and evaluates it, including changes in emotional expression.

[0195] The system implementing the present invention is configured to effectively utilize the user's voice data and biometric information to comprehensively improve vocalization and emotional expression.

[0196] First, the user speaks into a device equipped with a microphone and sensors. This device could be a smartphone, tablet, or personal computer. The device collects voice data and biometric information such as heart rate and respiration, and this data is transmitted to a server via the network.

[0197] The server uses an AI speech analysis engine to analyze voice quality and vocal characteristics in detail in order to process speech in real time. Furthermore, to recognize emotional states, an emotion engine analyzes the intonation, rhythm, and acquired biometric information of the speech. These analysis processes are generally performed using GPUs and high-performance computing environments.

[0198] Subsequently, the server's generation mechanism creates an optimal practice plan for the user based on the analysis results. This plan includes practice content aimed at improving expressiveness, taking into account the user's emotional state. The generated practice plan is immediately sent to the terminal and presented to the user. The terminal provides the user with real-time feedback, offering advice on areas for improvement in voice and emotional expression.

[0199] As a concrete example, consider a user practicing opera arias on their days off. The system grasps the emotions embedded in the piece and provides precise advice when the user struggles to express those emotions. In this way, users can simultaneously hone not only their musical technique but also their ability to express emotions richly.

[0200] A concrete example of a prompt for a generative AI model would be: "Please design an application that analyzes voice data and recognizes user emotions in real time. This application will be installed on a consumer robot and will optimize user feedback."

[0201] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0202] Step 1:

[0203] The device acquires voice data and biometric information simultaneously when the user speaks into the microphone. It collects voice signals and data such as heart rate and respiration as input, generating an initial digital voice file and biometric signal data. This information is then transmitted to a server via the network.

[0204] Step 2:

[0205] The server inputs the received audio data into an AI voice analysis engine to analyze voice quality and vocal characteristics. This process involves spectral analysis of the audio waveform to extract features such as pitch, tone, and volume. The analysis output provides detailed data on the technical characteristics of the user's voice.

[0206] Step 3:

[0207] The server uses an emotion engine to identify the user's emotional state from the voice data while simultaneously performing voice analysis. The input includes intonation, rhythm, and biometric data from the voice data. The emotion engine analyzes this data to identify emotional categories (e.g., joy, sadness, surprise). The result is output as data indicating the user's emotional state.

[0208] Step 4:

[0209] The server generates a personalized practice plan based on voice analysis results and emotional state. This planning takes into account the user's current technical ability and emotional needs, personalizing music selection and practice intensity. The generated practice plan is then sent to the device.

[0210] Step 5:

[0211] The device presents the user with a generated practice plan to apply during practice. It uses audio playback and text instructions to show the user specific practice content. The output of this step is visual or auditory information that guides the user through the next session.

[0212] Step 6:

[0213] The device monitors the user's performance during practice and provides real-time feedback from the server. It notifies the user of areas for improvement based on audio data or practice results, encouraging improvements in musical techniques, including emotional expression.

[0214] Step 7:

[0215] The server continuously records the user's practice history and analyzes growth trends. It uses past practice data as input to evaluate practice frequency and results. Based on this data, it generates progress reports that visualize progress, enabling users to self-evaluate their performance.

[0216] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0217] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0218] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0219] [Second Embodiment]

[0220] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0221] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0222] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0223] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0224] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0225] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0226] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0227] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0228] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0229] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0230] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0231] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0232] The system according to the present invention provides highly personalized practice plans and feedback to support the user's improvement of singing ability. This enables efficient practice based on the user's vocal characteristics, supporting long-term growth and vocal health management.

[0233] The system first collects the user's voice data via the terminal. When the user speaks into the terminal, the voice data is sent to the server in real time. On the server, the received voice data is analyzed by an AI voice analysis engine, and vocal characteristics such as voice quality, pitch, and rhythm are evaluated in detail. Based on this evaluation result, an optimal practice plan is generated for the user.

[0234] The generated practice plan is sent from the server to the terminal and presented to the user through the terminal. This practice plan includes specific exercises and goals that the user should work on, and the degree of achievement and areas for improvement are shown through real-time feedback. Feedback is provided immediately, for example, when the user sings off-key, and is communicated to the user in visual or audible form.

[0235] In addition, the system monitors the user's heart rate and respiratory data in real time and suggests relaxation techniques and warm-up exercises as needed. This information is acquired via sensors, analyzed on a server, and then transmitted to the terminal.

[0236] The system also accumulates the user's practice history and analyzes long-term growth trends. Based on this analysis, the server generates a dashboard that visualizes the user's progress. This dashboard is presented to the user via their device, motivating them to feel a sense of their own progress.

[0237] For example, in the case of a user practicing opera, the system creates a practice plan focused on a specific part of an aria based on the results of the previous practice session. If there are any problems with pitch or rhythm during practice, feedback is provided instantly, allowing for immediate correction. Through this process, the user can continue practicing while checking their own progress.

[0238] The following describes the processing flow.

[0239] Step 1:

[0240] The user speaks into the device and selects the song or scale they want to practice. The device captures this voice and collects the audio data in real time.

[0241] Step 2:

[0242] The terminal sends the collected audio data to the server. The server inputs the received audio data into the AI ​​voice analysis engine.

[0243] Step 3:

[0244] The server uses an AI voice analysis engine to analyze the voice data. This analysis evaluates vocal characteristics such as voice quality, pitch, volume, and rhythm.

[0245] Step 4:

[0246] The server generates a practice plan optimized for the user based on the analysis results. This practice plan includes specific exercises and areas for improvement.

[0247] Step 5:

[0248] The server sends the generated practice plan to the device. The device receives the practice plan and presents it to the user. This may be displayed visually using icons and text, and voice guidance is also possible.

[0249] Step 6:

[0250] The user follows the instructions on the device to perform exercises, recording heart rate and respiratory data during the process. This data is sent to the device via sensors.

[0251] Step 7:

[0252] The device sends practice status to the server in real time. The server generates feedback information and points out areas where the user needs to improve.

[0253] Step 8:

[0254] The device provides the user with feedback received from the server. Specifically, it visually or audibly points out issues such as pitch discrepancies or rhythmic inconsistencies.

[0255] Step 9:

[0256] The server accumulates user practice data and analyzes long-term growth trends. Based on this analysis, it generates a dashboard that visualizes progress.

[0257] Step 10:

[0258] Users can check the dashboard through their device to understand their progress and achievements, and use that information to practice towards their next goals.

[0259] (Example 1)

[0260] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0261] Conventional singing practice systems lacked sufficient practice planning and feedback based on individual user characteristics, making them ineffective in supporting growth. Furthermore, real-time evaluation necessary for improving musical expression and optimization of practice based on physiological states were difficult.

[0262] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0263] In this invention, the server includes means for collecting user voice data and transmitting the digitized voice information to the server; means for an AI voice analysis engine to evaluate voice characteristics and musical indicators using the voice information; and means for creating a user-specific practice strategy based on the evaluation results obtained by the analysis means. This enables the user to receive a personalized practice plan and real-time feedback, resulting in effective improvement of singing ability.

[0264] "Data collection means" refers to a device or software that has the function of capturing the user's voice, converting it into a digital format, and transmitting it to a server.

[0265] "Analysis means" refers to a system that uses a voice analysis engine on the server side to evaluate the voice quality and musical indicators of voice data.

[0266] A "generation method" refers to a system that has the function of automatically creating practice strategies tailored to individual users based on analyzed data.

[0267] A "feedback mechanism" is a device or function that provides real-time evaluation and suggestions for improvement of the user's practice and performance based on the generated practice strategy.

[0268] A "monitoring system" is a system that monitors the user's physiological data and recommends appropriate warm-up exercises according to their physical and mental state.

[0269] A "visualization tool" is a system that visually displays a user's practice progress and long-term growth, and has functions to boost motivation.

[0270] In this invention, the user operates a system that provides a voice practice environment. The user begins by collecting voice data using the microphone of a dedicated terminal. This terminal is equipped with high-performance voice recognition technology that converts the collected voice into digital data in real time and transmits it to a server over a network.

[0271] Upon receiving audio data, the server utilizes an AI voice analysis engine to analyze the speech. This analysis engine evaluates musical indicators such as voice quality, pitch, and rhythm, and based on these, understands the user's vocal characteristics. Based on the analysis results, the server generates a practice strategy optimized for the user's characteristics. This generation process uses voice processing algorithms and generative AI models.

[0272] The generated practice strategy is sent from the server to the terminal and presented to the user through the terminal's interface. This strategy includes specific exercises and goals, allowing the user to practice while receiving real-time feedback. This feedback is provided as visual and auditory instructions to help with pitch adjustments and rhythm improvements.

[0273] Furthermore, the device can monitor physiological indicators such as heart rate and respiration using sensors. The server analyzes this data and recommends relaxation techniques and warm-up exercises as needed. This reduces the physical burden on the user and provides a safe and effective training environment.

[0274] Furthermore, the server stores the user's practice history in a database and generates a dashboard that visualizes their progress. Users can check this progress information through their devices, which can help them maintain their motivation.

[0275] For example, a user practicing opera could use a prompt like this: "I want to practice opera arias. For this practice session, I want to improve my pitch and rhythm accuracy, so please provide a practice plan and feedback specifically tailored to that."

[0276] This system comprehensively provides a series of processes, from collecting and analyzing audio data to generating practice strategies and providing feedback, supporting the efficient improvement of users' singing abilities.

[0277] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0278] Step 1:

[0279] The user uses the microphone of the terminal to voice. The terminal converts this voice into digital voice data and transmits it to the server through the network. The input of this step is the user's raw voice, and the output is the voice data in digital format. Specifically, the built-in A / D converter in the terminal performs the operation of converting analog voice into digital signals.

[0280] Step 2:

[0281] The server applies the received digital voice data to the AI voice analysis engine. The analysis engine analyzes the voice data, extracts musical features such as voice quality, pitch, and rhythm, and evaluates them as numerical data. The input of this step is the voice data in digital format, and the output is the numerical data indicating the user's vocal musical features. Specifically, the feature extraction algorithm performs the operation of calculating the necessary indicators from the voice waveform.

[0282] Step 3:

[0283] Based on the analyzed vocal musical features, the server generates an individual practice strategy for the user. Here, a generation AI model is used to create a practice plan that includes exercises to strengthen the user's weaknesses and areas for improvement. The input of this step is the numerical data of the user's vocal musical features, and the output is a specific practice plan. Specifically, the AI model performs the operation of recommending the optimal practice content based on past data.

[0284] <00°0899>Step 4:

[0285] The generated practice plan is transmitted from the server to the terminal and presented to the user visually and audibly. The input here is the practice plan data, and the output is the content presented to the user on the terminal. Specifically, the user interface receives the plan information and performs the operations of screen display and voice output.

[0286] <00°0905>Step 5:

[0287] When the user begins practicing, the device again collects audio and sends it to the server to generate scientific feedback. The server uses an AI analysis engine to perform real-time evaluations and provide feedback on pitch deviations and rhythmic irregularities. The input for this step is the user's audio data during practice, and the output is real-time feedback to the user. Specifically, the feedback content is calculated and sent to the device immediately.

[0288] Step 6:

[0289] The device collects the user's heart rate and respiratory data using sensors and sends it to the server. The server analyzes this physiological data and suggests relaxation and warm-up exercises as needed. The input for this step is physiological data, and the output is suggested exercises. Specifically, the server assesses the stress level and selects relaxation methods.

[0290] Step 7:

[0291] The server continuously records the user's practice history and generates a dashboard that visualizes their progress. The dashboard is displayed on the device, showing the user their progress. The input for this step is the user's practice data, and the output is visualized progress information. Specifically, it extracts information from the database and generates graphs.

[0292] (Application Example 1)

[0293] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0294] For users to efficiently improve their singing ability, personalized practice plans and immediate feedback are crucial. However, conventional technology struggles to provide such a high level of personalization, and there is a particular lack of systems to support practice in home environments. Furthermore, the absence of mechanical devices to effectively support users during vocal practice is also a problem.

[0295] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0296] In this invention, the server includes data acquisition means for acquiring user voice data, analysis means for analyzing the voice data and evaluating the user's acoustic and vocal characteristics, and plan generation means for generating a practice plan customized for the user based on the results evaluated by the analysis means. This provides the user with a suitable practice plan and real-time feedback, making it possible to effectively improve singing ability even in a home environment.

[0297] "Data acquisition means" refers to functions or devices for acquiring user voice data.

[0298] "Analysis means" refers to functions or devices that analyze acquired audio data and evaluate the user's acoustic characteristics and vocalization characteristics.

[0299] A "plan generation means" is a function or device that generates a customized practice plan for the user based on the evaluation results obtained by the analysis means.

[0300] A "feedback provision means" refers to a function or device that provides real-time feedback to the user based on the generated practice plan.

[0301] "Monitoring means" refers to functions or devices that monitor a user's practice history and growth trends, and visualize progress data.

[0302] The "Household Mechanical Device Operation Means" is a function or device that uses household mechanical devices to present instant feedback based on sound and rhythm to users when they perform vocal exercises.

[0303] The present invention is a system that acquires and analyzes users' voice data and generates and provides personalized practice plans in order to effectively support users' vocal exercises. This system includes data acquisition means for collecting users' voice data, and uses a microphone to acquire voice data in real time. The acquired voice data is transmitted to a server and analyzed by an AI voice analysis engine.

[0304] The server evaluates users' acoustic characteristics and vocalization characteristics through the analysis of voice data. Specifically, analysis software developed in Python or C++, or libraries such as TensorFlow and PyTorch are used. Based on the analysis results, a practice plan suitable for the user is generated by the plan generation means.

[0305] The generated practice plan provides real-time feedback to the user through a terminal. The feedback is presented to the user visually or aurally by the household mechanical device operation means. The user's heart rate and breathing data are also collected through sensors and reflected in suggestions such as relaxation techniques.

[0306] As a specific example, consider the case where a child who practices singing at home uses this system. When the child practices singing, the system provides instant feedback such as "The pitch is a little high. Try making it a little lower" when the pitch or rhythm is off.

[0307] An example of a prompt sentence for the generation AI model is "Analyze the user's voice data in real time, evaluate the pitch and vocal quality, and generate an optimal practice plan." As a result, even in a household environment, users can continuously improve their singing ability.

[0308] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0309] Step 1:

[0310] When a user begins singing using the device, the device collects audio data via its microphone. The user's voice waveform is obtained as input and sent to the server.

[0311] Step 2:

[0312] The server passes the received audio data to an AI speech analysis engine, which analyzes the acoustic and vocal characteristics. Specifically, it uses FFT (Fast Fourier Transform) to analyze the frequency components of the sound field and extract pitch and rhythm. The output of this analysis is data on the quality, pitch, and rhythm of the user's voice.

[0313] Step 3:

[0314] Based on the analysis results, the server generates a customized practice plan for the user using a plan generation mechanism. The input is audio analysis data, and the output is a list of practice items and goals suitable for the user. This process is performed using a generation AI model, and prompts such as "Analyze the user's audio data in real time, evaluate pitch and voice quality, and generate the optimal practice plan" are used to determine practice content that takes the user's progress into consideration.

[0315] Step 4:

[0316] The device receives the generated practice plan and provides real-time presentation and feedback to the user. Specifically, real-time guidance is provided to the user through visual displays and voice messages. Voice data and the practice plan are used as input for feedback, and progress and areas for improvement are communicated to the user as output.

[0317] Step 5:

[0318] The device monitors the user's heart rate and respiratory data in real time and suggests relaxation techniques as needed. Input is data acquired by physiological data sensors, and output provides the user with instructions and suggestions regarding relaxation.

[0319] Step 6:

[0320] The server stores the user's practice history and monitors long-term growth trends. Past practice data is stored on the server as input, and information visualizing the growth trend is generated on a dashboard for the user as output. This allows users to visually understand their own progress.

[0321] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0322] This invention enhances the effectiveness of vocal training by adding a function to recognize the user's emotions to a system that provides users with personalized practice plans and real-time feedback during voice training. This aims to improve not only the quality of the user's voice but also their overall singing ability, including emotional expression.

[0323] The system first collects the user's voice and biometric data via the terminal. When the user speaks into the terminal, the voice data is sent to the server. Once the voice data reaches the server, it is input into an AI voice analysis engine, where the voice quality and vocal characteristics are analyzed in detail.

[0324] In addition, this system incorporates an emotion engine that analyzes the user's emotional state from voice data. The emotion engine identifies the user's emotions based on factors such as voice intonation and rhythm, as well as heart rate fluctuations obtained from biometric data.

[0325] Based on the analysis results, the server generates a personalized practice plan tailored to the user's emotions and voice quality. This practice plan includes specific exercises that utilize the user's current emotional state, enabling them to improve their emotional expression.

[0326] The generated practice plan is sent from the server to the device and presented to the user on the device. The device also receives feedback from the server during practice and notifies the user in real time. The feedback includes not only areas for improvement in voice but also evaluations of the emotions being expressed.

[0327] For example, if a user is practicing an opera aria, the system will grasp the emotions embedded in the piece and provide a practice plan based on that. If the user struggles with expressing those emotions at a certain stage, the system can provide precise advice based on the analysis results of its emotion engine. In this way, users can hone not only their musical technique but also their expressive abilities simultaneously.

[0328] The following describes the processing flow.

[0329] Step 1:

[0330] The user performs an emotionally charged voice response to the device. The device collects biometric data, such as the user's heart rate, along with the voice data.

[0331] Step 2:

[0332] The device sends the collected voice and biometric data to the server. The server inputs the received data into its AI voice analysis engine and begins the analysis.

[0333] Step 3:

[0334] The server performs speech analysis to evaluate vocal characteristics such as voice quality, pitch, and rhythm. Simultaneously, it uses an emotion engine to analyze the user's emotional state from the speech data. This includes changes in intonation and rhythm, as well as fluctuations in biometric data.

[0335] Step 4:

[0336] Based on the analysis results, the server generates a practice plan optimized for the user's emotions and vocal characteristics. This plan includes specific exercises that take the user's emotional state into account, as well as content aimed at improving expressive techniques.

[0337] Step 5:

[0338] The server sends the generated practice plan to the terminal. The terminal receives the practice plan and presents it to the user visually and audibly. The user performs vocal exercises according to the instructed practice plan.

[0339] Step 6:

[0340] While the user is practicing, the device continuously collects voice and biometric data and sends it to the server in real time. The server generates feedback based on this data.

[0341] Step 7:

[0342] The server generates feedback, including technical improvements to the user's voice and advice on emotional expression. This feedback includes specific suggestions for improvement and advice on how to express emotions effectively.

[0343] Step 8:

[0344] The device provides users with real-time feedback from the server, allowing them to immediately apply it to their practice. For example, it might suggest specific ways to emphasize areas where emotional expression is lacking.

[0345] Step 9:

[0346] The server accumulates all of the user's practice data over the long term and analyzes changes in emotional state and technical growth trends. This data is provided to the user and can be used to improve their next practice session.

[0347] Based on this feedback and progress data, users can set new goals to improve their expressive abilities and continue practicing.

[0348] (Example 2)

[0349] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0350] Conventional voice training systems primarily focused on voice quality and technical aspects, limiting their ability to improve users' emotional expression. Furthermore, they faced challenges in appropriately evaluating the impact of the user's physical and mental state on the training's effectiveness and incorporating this into the training plan.

[0351] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0352] In this invention, the server includes data collection means for acquiring the user's voice data and biometric data, analysis means for analyzing voice and emotions and evaluating the user's voice quality and emotional state, and generation means for generating a practice plan that utilizes the user's emotions. This makes it possible to improve the user's voice quality and simultaneously enhance their emotional expression.

[0353] "Data collection means" refers to a device or method for acquiring a user's voice data and biometric data.

[0354] "Analysis methods" refer to processes and technologies for evaluating the user's voice quality and vocal characteristics from acquired audio data, and further analyzing their emotions.

[0355] "Emotional analysis means" refers to a device or method for identifying a user's emotional state based on voice intonation and biometric data.

[0356] "Generation means" refers to an apparatus or method for generating a personalized practice plan for a user based on the analysis results.

[0357] A "feedback device" is a device or method that provides real-time evaluation of voice and emotional expression in accordance with the generated practice plan.

[0358] "Monitoring means" refers to a device or method for tracking a user's practice history and growth trends, and for visualizing their progress.

[0359] This system aims to support users' voice training and improve their overall abilities, including emotional expression. Users input voice data through a device, and this voice data and biometric data are collected. Data collection is performed using a microphone and heart rate sensor built into the device.

[0360] The device sends this data to the server, which first uses an AI voice analysis engine to analyze the quality, tone, and rhythm of the voice. This voice analysis utilizes machine learning algorithms. Furthermore, an emotion analysis engine analyzes intonation, rhythm, vital data, etc., from the voice data to identify the user's emotional state at that moment.

[0361] Based on the analysis results, the server generates a practice plan optimized for the user. This plan includes content that takes advantage of the user's emotional state during practice. For example, if relaxation is needed, vocal exercises that promote relaxation can be incorporated.

[0362] The practice plan is sent from the server to the terminal and presented to the user. The terminal screen displays the practice steps and provides voice guidance. Voice and biometric data during practice are sent back to the server to provide real-time feedback.

[0363] For example, if a user enters a prompt such as, "I want to sing opera arias with more emotion. Please advise me on what kind of practice I should do," the system can instantly generate and present a practice plan that meets that request.

[0364] This system allows users to simultaneously improve their voice skills and emotional expression. Furthermore, by incorporating the latest trends in the music field, it allows for the continuous adoption of new training methods.

[0365] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0366] Step 1:

[0367] The user speaks into the device. The device captures the voice data with its built-in microphone and simultaneously collects biometric data such as heart rate using sensors. Both voice data and biometric data are acquired as input. This provides basic data to understand the user's vocalizations.

[0368] Step 2:

[0369] The terminal transmits the collected voice and biometric data to the server. The data is encrypted using a secure protocol and safely transferred to the server. The input here is encrypted voice and biometric data, while the output to the server is transmitted in its original data format.

[0370] Step 3:

[0371] The server inputs the received audio data into an AI speech analysis engine. This analysis engine uses deep learning algorithms to analyze the quality, tone, and rhythm of the speech in detail. The input is the audio data sent to the server, and the output is the speech characteristic information as a result of the analysis.

[0372] Step 4:

[0373] The server simultaneously analyzes voice and biometric data using an emotion analysis engine to identify the user's emotional state. The input includes voice intonation, pacing, and heart rate changes, while the output provides an evaluation of the user's emotional state.

[0374] Step 5:

[0375] The server integrates the results of voice analysis and emotion analysis to generate a personalized practice plan that reflects the user's emotions. The input is the analysis results, and the output is an individually customized practice plan. This provides the user with practice content optimized for their needs.

[0376] Step 6:

[0377] The server sends the generated practice plan to the terminal, and the terminal presents the plan to the user. At this time, the practice steps are displayed on the terminal as visual and audio guides. The input is the practice plan, and the output is the training content presented to the user.

[0378] Step 7:

[0379] The user practices according to the practice plan provided on the device. The device collects voice and biometric data again and sends it to the server in real time. The server generates feedback based on this data and notifies the user in real time through the device. The input is the user's practice data, and the output is feedback including areas for improvement and evaluation.

[0380] (Application Example 2)

[0381] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0382] Conventional voice training systems focused solely on the technical aspects of voice, failing to adequately improve users' emotional expression. Furthermore, the lack of feedback functions tailored to emotional states made it difficult for users to obtain practice plans that were expressive and emotionally responsive. This left the overall improvement of users' singing abilities as a challenge.

[0383] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0384] In this invention, the server includes data collection means for acquiring user voice data and biometric information, analysis means for analyzing the voice data, evaluating voice quality and vocalization characteristics, and recognizing emotional states, and generation means for generating a personalized practice plan based on the analysis results, including improving emotional expression. This allows the user to receive real-time feedback not only on voice technology but also on emotional expression, enabling comprehensive improvement of singing ability.

[0385] "Data collection means" refers to a device or method for acquiring a user's voice data and biometric information.

[0386] "Analysis means" refers to a device or method that uses acquired audio data to evaluate the user's voice quality and vocalization characteristics, and further recognizes the user's emotional state from the audio.

[0387] "Generation means" refers to a device or method for creating a personalized practice plan tailored to the user's emotional state based on analysis results, with the aim of improving emotional expression skills.

[0388] A "feedback device" is a device or method that provides the user with real-time suggestions for improvement in voice and emotional expression based on the generated practice plan.

[0389] "Monitoring means" refers to a device or method that tracks a user's practice history and growth trends, visualizes progress data, and evaluates it, including changes in emotional expression.

[0390] The system implementing the present invention is configured to effectively utilize the user's voice data and biometric information to comprehensively improve vocalization and emotional expression.

[0391] First, the user speaks into a device equipped with a microphone and sensors. This device could be a smartphone, tablet, or personal computer. The device collects voice data and biometric information such as heart rate and respiration, and this data is transmitted to a server via the network.

[0392] The server uses an AI speech analysis engine to analyze voice quality and vocal characteristics in detail in order to process speech in real time. Furthermore, to recognize emotional states, an emotion engine analyzes the intonation, rhythm, and acquired biometric information of the speech. These analysis processes are generally performed using GPUs and high-performance computing environments.

[0393] Subsequently, the server's generation mechanism creates an optimal practice plan for the user based on the analysis results. This plan includes practice content aimed at improving expressiveness, taking into account the user's emotional state. The generated practice plan is immediately sent to the terminal and presented to the user. The terminal provides the user with real-time feedback, offering advice on areas for improvement in voice and emotional expression.

[0394] As a concrete example, consider a user practicing opera arias on their days off. The system grasps the emotions embedded in the piece and provides precise advice when the user struggles to express those emotions. In this way, users can simultaneously hone not only their musical technique but also their ability to express emotions richly.

[0395] A concrete example of a prompt for a generative AI model would be: "Please design an application that analyzes voice data and recognizes user emotions in real time. This application will be installed on a consumer robot and will optimize user feedback."

[0396] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0397] Step 1:

[0398] The device acquires voice data and biometric information simultaneously when the user speaks into the microphone. It collects voice signals and data such as heart rate and respiration as input, generating an initial digital voice file and biometric signal data. This information is then transmitted to a server via the network.

[0399] Step 2:

[0400] The server inputs the received audio data into an AI voice analysis engine to analyze voice quality and vocal characteristics. This process involves spectral analysis of the audio waveform to extract features such as pitch, tone, and volume. The analysis output provides detailed data on the technical characteristics of the user's voice.

[0401] Step 3:

[0402] The server uses an emotion engine to identify the user's emotional state from the voice data while simultaneously performing voice analysis. The input includes intonation, rhythm, and biometric data from the voice data. The emotion engine analyzes this data to identify emotional categories (e.g., joy, sadness, surprise). The result is output as data indicating the user's emotional state.

[0403] Step 4:

[0404] The server generates a personalized practice plan based on voice analysis results and emotional state. This planning takes into account the user's current technical ability and emotional needs, personalizing music selection and practice intensity. The generated practice plan is then sent to the device.

[0405] Step 5:

[0406] The device presents the user with a generated practice plan to apply during practice. It uses audio playback and text instructions to show the user specific practice content. The output of this step is visual or auditory information that guides the user through the next session.

[0407] Step 6:

[0408] The device monitors the user's performance during practice and provides real-time feedback from the server. It notifies the user of areas for improvement based on audio data or practice results, encouraging improvements in musical techniques, including emotional expression.

[0409] Step 7:

[0410] The server continuously records the user's practice history and analyzes growth trends. It uses past practice data as input to evaluate practice frequency and results. Based on this data, it generates progress reports that visualize progress, enabling users to self-evaluate their performance.

[0411] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0412] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0413] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0414] [Third Embodiment]

[0415] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0416] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0417] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0418] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0419] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0420] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0421] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0422] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0423] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0424] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0425] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0426] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0427] The system according to the present invention provides highly personalized practice plans and feedback to support the user's improvement of singing ability. This enables efficient practice based on the user's vocal characteristics, supporting long-term growth and vocal health management.

[0428] The system first collects the user's voice data via the terminal. When the user speaks into the terminal, the voice data is sent to the server in real time. On the server, the received voice data is analyzed by an AI voice analysis engine, and vocal characteristics such as voice quality, pitch, and rhythm are evaluated in detail. Based on this evaluation result, an optimal practice plan is generated for the user.

[0429] The generated practice plan is sent from the server to the terminal and presented to the user through the terminal. This practice plan includes specific exercises and goals that the user should work on, and the degree of achievement and areas for improvement are shown through real-time feedback. Feedback is provided immediately, for example, when the user sings off-key, and is communicated to the user in visual or audible form.

[0430] In addition, the system monitors the user's heart rate and respiratory data in real time and suggests relaxation techniques and warm-up exercises as needed. This information is acquired via sensors, analyzed on a server, and then transmitted to the terminal.

[0431] The system also accumulates the user's practice history and analyzes long-term growth trends. Based on this analysis, the server generates a dashboard that visualizes the user's progress. This dashboard is presented to the user via their device, motivating them to feel a sense of their own progress.

[0432] For example, in the case of a user practicing opera, the system creates a practice plan focused on a specific part of an aria based on the results of the previous practice session. If there are any problems with pitch or rhythm during practice, feedback is provided instantly, allowing for immediate correction. Through this process, the user can continue practicing while checking their own progress.

[0433] The following describes the processing flow.

[0434] Step 1:

[0435] The user speaks into the device and selects the song or scale they want to practice. The device captures this voice and collects the audio data in real time.

[0436] Step 2:

[0437] The terminal sends the collected audio data to the server. The server inputs the received audio data into the AI ​​voice analysis engine.

[0438] Step 3:

[0439] The server uses an AI voice analysis engine to analyze the voice data. This analysis evaluates vocal characteristics such as voice quality, pitch, volume, and rhythm.

[0440] Step 4:

[0441] The server generates a practice plan optimized for the user based on the analysis results. This practice plan includes specific exercises and areas for improvement.

[0442] Step 5:

[0443] The server sends the generated practice plan to the device. The device receives the practice plan and presents it to the user. This may be displayed visually using icons and text, and voice guidance is also possible.

[0444] Step 6:

[0445] The user follows the instructions on the device to perform exercises, recording heart rate and respiratory data during the process. This data is sent to the device via sensors.

[0446] Step 7:

[0447] The device sends practice status to the server in real time. The server generates feedback information and points out areas where the user needs to improve.

[0448] Step 8:

[0449] The device provides the user with feedback received from the server. Specifically, it visually or audibly points out issues such as pitch discrepancies or rhythmic inconsistencies.

[0450] Step 9:

[0451] The server accumulates user practice data and analyzes long-term growth trends. Based on this analysis, it generates a dashboard that visualizes progress.

[0452] Step 10:

[0453] Users can check the dashboard through their device to understand their progress and achievements, and use that information to practice towards their next goals.

[0454] (Example 1)

[0455] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0456] Conventional singing practice systems lacked sufficient practice planning and feedback based on individual user characteristics, making them ineffective in supporting growth. Furthermore, real-time evaluation necessary for improving musical expression and optimization of practice based on physiological states were difficult.

[0457] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0458] In this invention, the server includes means for collecting user voice data and transmitting the digitized voice information to the server; means for an AI voice analysis engine to evaluate voice characteristics and musical indicators using the voice information; and means for creating a user-specific practice strategy based on the evaluation results obtained by the analysis means. This enables the user to receive a personalized practice plan and real-time feedback, resulting in effective improvement of singing ability.

[0459] "Data collection means" refers to a device or software that has the function of capturing the user's voice, converting it into a digital format, and transmitting it to a server.

[0460] "Analysis means" refers to a system that uses a voice analysis engine on the server side to evaluate the voice quality and musical indicators of voice data.

[0461] A "generation method" refers to a system that has the function of automatically creating practice strategies tailored to individual users based on analyzed data.

[0462] A "feedback mechanism" is a device or function that provides real-time evaluation and suggestions for improvement of the user's practice and performance based on the generated practice strategy.

[0463] A "monitoring system" is a system that monitors the user's physiological data and recommends appropriate warm-up exercises according to their physical and mental state.

[0464] A "visualization tool" is a system that visually displays a user's practice progress and long-term growth, and has functions to boost motivation.

[0465] In this invention, the user operates a system that provides a voice practice environment. The user begins by collecting voice data using the microphone of a dedicated terminal. This terminal is equipped with high-performance voice recognition technology that converts the collected voice into digital data in real time and transmits it to a server over a network.

[0466] Upon receiving audio data, the server utilizes an AI voice analysis engine to analyze the speech. This analysis engine evaluates musical indicators such as voice quality, pitch, and rhythm, and based on these, understands the user's vocal characteristics. Based on the analysis results, the server generates a practice strategy optimized for the user's characteristics. This generation process uses voice processing algorithms and generative AI models.

[0467] The generated practice strategy is sent from the server to the terminal and presented to the user through the terminal's interface. This strategy includes specific exercises and goals, allowing the user to practice while receiving real-time feedback. This feedback is provided as visual and auditory instructions to help with pitch adjustments and rhythm improvements.

[0468] Furthermore, the device can monitor physiological indicators such as heart rate and respiration using sensors. The server analyzes this data and recommends relaxation techniques and warm-up exercises as needed. This reduces the physical burden on the user and provides a safe and effective training environment.

[0469] Furthermore, the server stores the user's practice history in a database and generates a dashboard that visualizes their progress. Users can check this progress information through their devices, which can help them maintain their motivation.

[0470] For example, a user practicing opera could use a prompt like this: "I want to practice opera arias. For this practice session, I want to improve my pitch and rhythm accuracy, so please provide a practice plan and feedback specifically tailored to that."

[0471] This system comprehensively provides a series of processes, from collecting and analyzing audio data to generating practice strategies and providing feedback, supporting the efficient improvement of users' singing abilities.

[0472] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0473] Step 1:

[0474] The user speaks using the device's microphone. The device converts this voice into digital audio data and sends it to the server over the network. The input for this step is the user's raw voice, and the output is digital audio data. Specifically, the device's built-in A / D converter converts the analog audio into a digital signal.

[0475] Step 2:

[0476] The server processes the received digital audio data through an AI voice analysis engine. The analysis engine analyzes the audio data, extracting musical characteristics such as voice quality, pitch, and rhythm, and evaluating them as numerical data. The input for this step is digital audio data, and the output is numerical data representing the user's vocal characteristics. Specifically, a feature extraction algorithm calculates the necessary indicators from the voice waveform.

[0477] Step 3:

[0478] The server generates a personalized practice strategy for each user based on the analyzed vocal characteristics. Here, a generative AI model is used to create a practice plan that includes exercises designed to strengthen the user's weaknesses and areas for improvement. The input for this step is numerical data on the user's vocal characteristics, and the output is a specific practice plan. Specifically, the AI ​​model recommends the most suitable practice content based on past data.

[0479] Step 4:

[0480] The generated practice plan is sent from the server to the terminal and presented to the user visually and audibly. The input here is the practice plan data, and the output is the content presented to the user on the terminal. Specifically, the user interface receives the plan information and performs actions such as screen display and audio output.

[0481] Step 5:

[0482] When the user begins practicing, the device again collects audio and sends it to the server to generate scientific feedback. The server uses an AI analysis engine to perform real-time evaluations and provide feedback on pitch deviations and rhythmic irregularities. The input for this step is the user's audio data during practice, and the output is real-time feedback to the user. Specifically, the feedback content is calculated and sent to the device immediately.

[0483] Step 6:

[0484] The device collects the user's heart rate and respiratory data using sensors and sends it to the server. The server analyzes this physiological data and suggests relaxation and warm-up exercises as needed. The input for this step is physiological data, and the output is suggested exercises. Specifically, the server assesses the stress level and selects relaxation methods.

[0485] Step 7:

[0486] The server continuously records the user's practice history and generates a dashboard that visualizes their progress. The dashboard is displayed on the device, showing the user their progress. The input for this step is the user's practice data, and the output is visualized progress information. Specifically, it extracts information from the database and generates graphs.

[0487] (Application Example 1)

[0488] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0489] For users to efficiently improve their singing ability, personalized practice plans and immediate feedback are crucial. However, conventional technology struggles to provide such a high level of personalization, and there is a particular lack of systems to support practice in home environments. Furthermore, the absence of mechanical devices to effectively support users during vocal practice is also a problem.

[0490] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0491] In this invention, the server includes data acquisition means for acquiring user voice data, analysis means for analyzing the voice data and evaluating the user's acoustic and vocal characteristics, and plan generation means for generating a practice plan customized for the user based on the results evaluated by the analysis means. This provides the user with a suitable practice plan and real-time feedback, making it possible to effectively improve singing ability even in a home environment.

[0492] "Data acquisition means" refers to functions or devices for acquiring user voice data.

[0493] "Analysis means" refers to functions or devices that analyze acquired audio data and evaluate the user's acoustic characteristics and vocalization characteristics.

[0494] A "plan generation means" is a function or device that generates a customized practice plan for the user based on the evaluation results obtained by the analysis means.

[0495] A "feedback provision means" refers to a function or device that provides real-time feedback to the user based on the generated practice plan.

[0496] "Monitoring means" refers to functions or devices that monitor a user's practice history and growth trends, and visualize progress data.

[0497] "Household mechanical device operating means" refers to a function or device that uses household mechanical devices to provide users with immediate feedback based on sound and rhythm when they practice vocal exercises.

[0498] This invention relates to a system that acquires and analyzes user voice data to effectively support user voice practice and generates and provides a personalized practice plan. The system includes data acquisition means for collecting user voice data, which is acquired in real time using a microphone. The acquired voice data is transmitted to a server and analyzed by an AI voice analysis engine.

[0499] The server evaluates the user's acoustic and vocal characteristics through the analysis of audio data. Specifically, it uses analysis software developed in Python or C++, as well as libraries such as TensorFlow and PyTorch. Based on the analysis results, a practice plan suitable for the user is generated by a plan generation mechanism.

[0500] The generated training plan provides real-time feedback to the user via the device. This feedback is presented visually or audibly through the user's home-use device control system. The user's heart rate and respiratory data are also collected via sensors and incorporated into suggestions for relaxation techniques and other related data.

[0501] As a concrete example, consider a child using this system to practice singing at home. When the child practices singing, the system provides immediate feedback such as, "The pitch is a little high, try lowering it a bit," if the pitch or rhythm is off.

[0502] An example of a prompt to the generating AI model is, "Analyze the user's voice data in real time, evaluate pitch and voice quality, and generate an optimal practice plan." This allows users to continuously improve their singing ability even in a home environment.

[0503] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0504] Step 1:

[0505] When a user begins singing using the device, the device collects audio data via its microphone. The user's voice waveform is obtained as input and sent to the server.

[0506] Step 2:

[0507] The server passes the received audio data to an AI speech analysis engine, which analyzes the acoustic and vocal characteristics. Specifically, it uses FFT (Fast Fourier Transform) to analyze the frequency components of the sound field and extract pitch and rhythm. The output of this analysis is data on the quality, pitch, and rhythm of the user's voice.

[0508] Step 3:

[0509] Based on the analysis results, the server generates a customized practice plan for the user using a plan generation mechanism. The input is audio analysis data, and the output is a list of practice items and goals suitable for the user. This process is performed using a generation AI model, and prompts such as "Analyze the user's audio data in real time, evaluate pitch and voice quality, and generate the optimal practice plan" are used to determine practice content that takes the user's progress into consideration.

[0510] Step 4:

[0511] The device receives the generated practice plan and provides real-time presentation and feedback to the user. Specifically, real-time guidance is provided to the user through visual displays and voice messages. Voice data and the practice plan are used as input for feedback, and progress and areas for improvement are communicated to the user as output.

[0512] Step 5:

[0513] The device monitors the user's heart rate and respiratory data in real time and suggests relaxation techniques as needed. Input is data acquired by physiological data sensors, and output provides the user with instructions and suggestions regarding relaxation.

[0514] Step 6:

[0515] The server stores the user's practice history and monitors long-term growth trends. Past practice data is stored on the server as input, and information visualizing the growth trend is generated on a dashboard for the user as output. This allows users to visually understand their own progress.

[0516] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0517] This invention enhances the effectiveness of vocal training by adding a function to recognize the user's emotions to a system that provides users with personalized practice plans and real-time feedback during voice training. This aims to improve not only the quality of the user's voice but also their overall singing ability, including emotional expression.

[0518] The system first collects the user's voice and biometric data via the terminal. When the user speaks into the terminal, the voice data is sent to the server. Once the voice data reaches the server, it is input into an AI voice analysis engine, where the voice quality and vocal characteristics are analyzed in detail.

[0519] In addition, this system incorporates an emotion engine that analyzes the user's emotional state from voice data. The emotion engine identifies the user's emotions based on factors such as voice intonation and rhythm, as well as heart rate fluctuations obtained from biometric data.

[0520] Based on the analysis results, the server generates a personalized practice plan tailored to the user's emotions and voice quality. This practice plan includes specific exercises that utilize the user's current emotional state, enabling them to improve their emotional expression.

[0521] The generated practice plan is sent from the server to the device and presented to the user on the device. The device also receives feedback from the server during practice and notifies the user in real time. The feedback includes not only areas for improvement in voice but also evaluations of the emotions being expressed.

[0522] For example, if a user is practicing an opera aria, the system will grasp the emotions embedded in the piece and provide a practice plan based on that. If the user struggles with expressing those emotions at a certain stage, the system can provide precise advice based on the analysis results of its emotion engine. In this way, users can hone not only their musical technique but also their expressive abilities simultaneously.

[0523] The following describes the processing flow.

[0524] Step 1:

[0525] The user performs an emotionally charged voice response to the device. The device collects biometric data, such as the user's heart rate, along with the voice data.

[0526] Step 2:

[0527] The device sends the collected voice and biometric data to the server. The server inputs the received data into its AI voice analysis engine and begins the analysis.

[0528] Step 3:

[0529] The server performs speech analysis to evaluate vocal characteristics such as voice quality, pitch, and rhythm. Simultaneously, it uses an emotion engine to analyze the user's emotional state from the speech data. This includes changes in intonation and rhythm, as well as fluctuations in biometric data.

[0530] Step 4:

[0531] Based on the analysis results, the server generates a practice plan optimized for the user's emotions and vocal characteristics. This plan includes specific exercises that take the user's emotional state into account, as well as content aimed at improving expressive techniques.

[0532] Step 5:

[0533] The server sends the generated practice plan to the terminal. The terminal receives the practice plan and presents it to the user visually and audibly. The user performs vocal exercises according to the instructed practice plan.

[0534] Step 6:

[0535] While the user is practicing, the device continuously collects voice and biometric data and sends it to the server in real time. The server generates feedback based on this data.

[0536] Step 7:

[0537] The server generates feedback, including technical improvements to the user's voice and advice on emotional expression. This feedback includes specific suggestions for improvement and advice on how to express emotions effectively.

[0538] Step 8:

[0539] The device provides users with real-time feedback from the server, allowing them to immediately apply it to their practice. For example, it might suggest specific ways to emphasize areas where emotional expression is lacking.

[0540] Step 9:

[0541] The server accumulates all of the user's practice data over the long term and analyzes changes in emotional state and technical growth trends. This data is provided to the user and can be used to improve their next practice session.

[0542] Based on this feedback and progress data, users can set new goals to improve their expressive abilities and continue practicing.

[0543] (Example 2)

[0544] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0545] Conventional voice training systems primarily focused on voice quality and technical aspects, limiting their ability to improve users' emotional expression. Furthermore, they faced challenges in appropriately evaluating the impact of the user's physical and mental state on the training's effectiveness and incorporating this into the training plan.

[0546] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0547] In this invention, the server includes data collection means for acquiring the user's voice data and biometric data, analysis means for analyzing voice and emotions and evaluating the user's voice quality and emotional state, and generation means for generating a practice plan that utilizes the user's emotions. This makes it possible to improve the user's voice quality and simultaneously enhance their emotional expression.

[0548] "Data collection means" refers to a device or method for acquiring a user's voice data and biometric data.

[0549] "Analysis methods" refer to processes and technologies for evaluating the user's voice quality and vocal characteristics from acquired audio data, and further analyzing their emotions.

[0550] "Emotional analysis means" refers to a device or method for identifying a user's emotional state based on voice intonation and biometric data.

[0551] "Generation means" refers to an apparatus or method for generating a personalized practice plan for a user based on the analysis results.

[0552] A "feedback device" is a device or method that provides real-time evaluation of voice and emotional expression in accordance with the generated practice plan.

[0553] "Monitoring means" refers to a device or method for tracking a user's practice history and growth trends, and for visualizing their progress.

[0554] This system aims to support users' voice training and improve their overall abilities, including emotional expression. Users input voice data through a device, and this voice data and biometric data are collected. Data collection is performed using a microphone and heart rate sensor built into the device.

[0555] The device sends this data to the server, which first uses an AI voice analysis engine to analyze the quality, tone, and rhythm of the voice. This voice analysis utilizes machine learning algorithms. Furthermore, an emotion analysis engine analyzes intonation, rhythm, vital data, etc., from the voice data to identify the user's emotional state at that moment.

[0556] Based on the analysis results, the server generates a practice plan optimized for the user. This plan includes content that takes advantage of the user's emotional state during practice. For example, if relaxation is needed, vocal exercises that promote relaxation can be incorporated.

[0557] The practice plan is sent from the server to the terminal and presented to the user. The terminal screen displays the practice steps and provides voice guidance. Voice and biometric data during practice are sent back to the server to provide real-time feedback.

[0558] For example, if a user enters a prompt such as, "I want to sing opera arias with more emotion. Please advise me on what kind of practice I should do," the system can instantly generate and present a practice plan that meets that request.

[0559] This system allows users to simultaneously improve their voice skills and emotional expression. Furthermore, by incorporating the latest trends in the music field, it allows for the continuous adoption of new training methods.

[0560] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0561] Step 1:

[0562] The user speaks into the device. The device captures the voice data with its built-in microphone and simultaneously collects biometric data such as heart rate using sensors. Both voice data and biometric data are acquired as input. This provides basic data to understand the user's vocalizations.

[0563] Step 2:

[0564] The terminal transmits the collected voice and biometric data to the server. The data is encrypted using a secure protocol and safely transferred to the server. The input here is encrypted voice and biometric data, while the output to the server is transmitted in its original data format.

[0565] Step 3:

[0566] The server inputs the received audio data into an AI speech analysis engine. This analysis engine uses deep learning algorithms to analyze the quality, tone, and rhythm of the speech in detail. The input is the audio data sent to the server, and the output is the speech characteristic information as a result of the analysis.

[0567] Step 4:

[0568] The server simultaneously analyzes voice and biometric data using an emotion analysis engine to identify the user's emotional state. The input includes voice intonation, pacing, and heart rate changes, while the output provides an evaluation of the user's emotional state.

[0569] Step 5:

[0570] The server integrates the results of voice analysis and emotion analysis to generate a personalized practice plan that reflects the user's emotions. The input is the analysis results, and the output is an individually customized practice plan. This provides the user with practice content optimized for their needs.

[0571] Step 6:

[0572] The server sends the generated practice plan to the terminal, and the terminal presents the plan to the user. At this time, the practice steps are displayed on the terminal as visual and audio guides. The input is the practice plan, and the output is the training content presented to the user.

[0573] Step 7:

[0574] The user practices according to the practice plan provided on the device. The device collects voice and biometric data again and sends it to the server in real time. The server generates feedback based on this data and notifies the user in real time through the device. The input is the user's practice data, and the output is feedback including areas for improvement and evaluation.

[0575] (Application Example 2)

[0576] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0577] Conventional voice training systems focused solely on the technical aspects of voice, failing to adequately improve users' emotional expression. Furthermore, the lack of feedback functions tailored to emotional states made it difficult for users to obtain practice plans that were expressive and emotionally responsive. This left the overall improvement of users' singing abilities as a challenge.

[0578] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0579] In this invention, the server includes data collection means for acquiring user voice data and biometric information, analysis means for analyzing the voice data, evaluating voice quality and vocalization characteristics, and recognizing emotional states, and generation means for generating a personalized practice plan based on the analysis results, including improving emotional expression. This allows the user to receive real-time feedback not only on voice technology but also on emotional expression, enabling comprehensive improvement of singing ability.

[0580] "Data collection means" refers to a device or method for acquiring a user's voice data and biometric information.

[0581] "Analysis means" refers to a device or method that uses acquired audio data to evaluate the user's voice quality and vocalization characteristics, and further recognizes the user's emotional state from the audio.

[0582] "Generation means" refers to a device or method for creating a personalized practice plan tailored to the user's emotional state based on analysis results, with the aim of improving emotional expression skills.

[0583] A "feedback device" is a device or method that provides the user with real-time suggestions for improvement in voice and emotional expression based on the generated practice plan.

[0584] "Monitoring means" refers to a device or method that tracks a user's practice history and growth trends, visualizes progress data, and evaluates it, including changes in emotional expression.

[0585] The system implementing the present invention is configured to effectively utilize the user's voice data and biometric information to comprehensively improve vocalization and emotional expression.

[0586] First, the user speaks into a device equipped with a microphone and sensors. This device could be a smartphone, tablet, or personal computer. The device collects voice data and biometric information such as heart rate and respiration, and this data is transmitted to a server via the network.

[0587] The server uses an AI speech analysis engine to analyze voice quality and vocal characteristics in detail in order to process speech in real time. Furthermore, to recognize emotional states, an emotion engine analyzes the intonation, rhythm, and acquired biometric information of the speech. These analysis processes are generally performed using GPUs and high-performance computing environments.

[0588] Subsequently, the server's generation mechanism creates an optimal practice plan for the user based on the analysis results. This plan includes practice content aimed at improving expressiveness, taking into account the user's emotional state. The generated practice plan is immediately sent to the terminal and presented to the user. The terminal provides the user with real-time feedback, offering advice on areas for improvement in voice and emotional expression.

[0589] As a concrete example, consider a user practicing opera arias on their days off. The system grasps the emotions embedded in the piece and provides precise advice when the user struggles to express those emotions. In this way, users can simultaneously hone not only their musical technique but also their ability to express emotions richly.

[0590] A concrete example of a prompt for a generative AI model would be: "Please design an application that analyzes voice data and recognizes user emotions in real time. This application will be installed on a consumer robot and will optimize user feedback."

[0591] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0592] Step 1:

[0593] The device acquires voice data and biometric information simultaneously when the user speaks into the microphone. It collects voice signals and data such as heart rate and respiration as input, generating an initial digital voice file and biometric signal data. This information is then transmitted to a server via the network.

[0594] Step 2:

[0595] The server inputs the received audio data into an AI voice analysis engine to analyze voice quality and vocal characteristics. This process involves spectral analysis of the audio waveform to extract features such as pitch, tone, and volume. The analysis output provides detailed data on the technical characteristics of the user's voice.

[0596] Step 3:

[0597] The server uses an emotion engine to identify the user's emotional state from the voice data while simultaneously performing voice analysis. The input includes intonation, rhythm, and biometric data from the voice data. The emotion engine analyzes this data to identify emotional categories (e.g., joy, sadness, surprise). The result is output as data indicating the user's emotional state.

[0598] Step 4:

[0599] The server generates a personalized practice plan based on voice analysis results and emotional state. This planning takes into account the user's current technical ability and emotional needs, personalizing music selection and practice intensity. The generated practice plan is then sent to the device.

[0600] Step 5:

[0601] The device presents the user with a generated practice plan to apply during practice. It uses audio playback and text instructions to show the user specific practice content. The output of this step is visual or auditory information that guides the user through the next session.

[0602] Step 6:

[0603] The device monitors the user's performance during practice and provides real-time feedback from the server. It notifies the user of areas for improvement based on audio data or practice results, encouraging improvements in musical techniques, including emotional expression.

[0604] Step 7:

[0605] The server continuously records the user's practice history and analyzes growth trends. It uses past practice data as input to evaluate practice frequency and results. Based on this data, it generates progress reports that visualize progress, enabling users to self-evaluate their performance.

[0606] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0607] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0608] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0609] [Fourth Embodiment]

[0610] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0611] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0612] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0613] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0614] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0615] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0616] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0617] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0618] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0619] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0620] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0621] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0622] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0623] The system according to the present invention provides highly personalized practice plans and feedback to support the user's improvement of singing ability. This enables efficient practice based on the user's vocal characteristics, supporting long-term growth and vocal health management.

[0624] The system first collects the user's voice data via the terminal. When the user speaks into the terminal, the voice data is sent to the server in real time. On the server, the received voice data is analyzed by an AI voice analysis engine, and vocal characteristics such as voice quality, pitch, and rhythm are evaluated in detail. Based on this evaluation result, an optimal practice plan is generated for the user.

[0625] The generated practice plan is sent from the server to the terminal and presented to the user through the terminal. This practice plan includes specific exercises and goals that the user should work on, and the degree of achievement and areas for improvement are shown through real-time feedback. Feedback is provided immediately, for example, when the user sings off-key, and is communicated to the user in visual or audible form.

[0626] In addition, the system monitors the user's heart rate and respiratory data in real time and suggests relaxation techniques and warm-up exercises as needed. This information is acquired via sensors, analyzed on a server, and then transmitted to the terminal.

[0627] The system also accumulates the user's practice history and analyzes long-term growth trends. Based on this analysis, the server generates a dashboard that visualizes the user's progress. This dashboard is presented to the user via their device, motivating them to feel a sense of their own progress.

[0628] For example, in the case of a user practicing opera, the system creates a practice plan focused on a specific part of an aria based on the results of the previous practice session. If there are any problems with pitch or rhythm during practice, feedback is provided instantly, allowing for immediate correction. Through this process, the user can continue practicing while checking their own progress.

[0629] The following describes the processing flow.

[0630] Step 1:

[0631] The user speaks into the device and selects the song or scale they want to practice. The device captures this voice and collects the audio data in real time.

[0632] Step 2:

[0633] The terminal sends the collected audio data to the server. The server inputs the received audio data into the AI ​​voice analysis engine.

[0634] Step 3:

[0635] The server uses an AI voice analysis engine to analyze the voice data. This analysis evaluates vocal characteristics such as voice quality, pitch, volume, and rhythm.

[0636] Step 4:

[0637] The server generates a practice plan optimized for the user based on the analysis results. This practice plan includes specific exercises and areas for improvement.

[0638] Step 5:

[0639] The server sends the generated practice plan to the device. The device receives the practice plan and presents it to the user. This may be displayed visually using icons and text, and voice guidance is also possible.

[0640] Step 6:

[0641] The user follows the instructions on the device to perform exercises, recording heart rate and respiratory data during the process. This data is sent to the device via sensors.

[0642] Step 7:

[0643] The device sends practice status to the server in real time. The server generates feedback information and points out areas where the user needs to improve.

[0644] Step 8:

[0645] The device provides the user with feedback received from the server. Specifically, it visually or audibly points out issues such as pitch discrepancies or rhythmic inconsistencies.

[0646] Step 9:

[0647] The server accumulates user practice data and analyzes long-term growth trends. Based on this analysis, it generates a dashboard that visualizes progress.

[0648] Step 10:

[0649] Users can check the dashboard through their device to understand their progress and achievements, and use that information to practice towards their next goals.

[0650] (Example 1)

[0651] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0652] Conventional singing practice systems lacked sufficient practice planning and feedback based on individual user characteristics, making them ineffective in supporting growth. Furthermore, real-time evaluation necessary for improving musical expression and optimization of practice based on physiological states were difficult.

[0653] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0654] In this invention, the server includes means for collecting user voice data and transmitting the digitized voice information to the server; means for an AI voice analysis engine to evaluate voice characteristics and musical indicators using the voice information; and means for creating a user-specific practice strategy based on the evaluation results obtained by the analysis means. This enables the user to receive a personalized practice plan and real-time feedback, resulting in effective improvement of singing ability.

[0655] "Data collection means" refers to a device or software that has the function of capturing the user's voice, converting it into a digital format, and transmitting it to a server.

[0656] "Analysis means" refers to a system that uses a voice analysis engine on the server side to evaluate the voice quality and musical indicators of voice data.

[0657] A "generation method" refers to a system that has the function of automatically creating practice strategies tailored to individual users based on analyzed data.

[0658] A "feedback mechanism" is a device or function that provides real-time evaluation and suggestions for improvement of the user's practice and performance based on the generated practice strategy.

[0659] A "monitoring system" is a system that monitors the user's physiological data and recommends appropriate warm-up exercises according to their physical and mental state.

[0660] A "visualization tool" is a system that visually displays a user's practice progress and long-term growth, and has functions to boost motivation.

[0661] In this invention, the user operates a system that provides a voice practice environment. The user begins by collecting voice data using the microphone of a dedicated terminal. This terminal is equipped with high-performance voice recognition technology that converts the collected voice into digital data in real time and transmits it to a server over a network.

[0662] Upon receiving audio data, the server utilizes an AI voice analysis engine to analyze the speech. This analysis engine evaluates musical indicators such as voice quality, pitch, and rhythm, and based on these, understands the user's vocal characteristics. Based on the analysis results, the server generates a practice strategy optimized for the user's characteristics. This generation process uses voice processing algorithms and generative AI models.

[0663] The generated practice strategy is sent from the server to the terminal and presented to the user through the terminal's interface. This strategy includes specific exercises and goals, allowing the user to practice while receiving real-time feedback. This feedback is provided as visual and auditory instructions to help with pitch adjustments and rhythm improvements.

[0664] Furthermore, the device can monitor physiological indicators such as heart rate and respiration using sensors. The server analyzes this data and recommends relaxation techniques and warm-up exercises as needed. This reduces the physical burden on the user and provides a safe and effective training environment.

[0665] Furthermore, the server stores the user's practice history in a database and generates a dashboard that visualizes their progress. Users can check this progress information through their devices, which can help them maintain their motivation.

[0666] For example, a user practicing opera could use a prompt like this: "I want to practice opera arias. For this practice session, I want to improve my pitch and rhythm accuracy, so please provide a practice plan and feedback specifically tailored to that."

[0667] This system comprehensively provides a series of processes, from collecting and analyzing audio data to generating practice strategies and providing feedback, supporting the efficient improvement of users' singing abilities.

[0668] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0669] Step 1:

[0670] The user speaks using the device's microphone. The device converts this voice into digital audio data and sends it to the server over the network. The input for this step is the user's raw voice, and the output is digital audio data. Specifically, the device's built-in A / D converter converts the analog audio into a digital signal.

[0671] Step 2:

[0672] The server processes the received digital audio data through an AI voice analysis engine. The analysis engine analyzes the audio data, extracting musical characteristics such as voice quality, pitch, and rhythm, and evaluating them as numerical data. The input for this step is digital audio data, and the output is numerical data representing the user's vocal characteristics. Specifically, a feature extraction algorithm calculates the necessary indicators from the voice waveform.

[0673] Step 3:

[0674] The server generates a personalized practice strategy for each user based on the analyzed vocal characteristics. Here, a generative AI model is used to create a practice plan that includes exercises designed to strengthen the user's weaknesses and areas for improvement. The input for this step is numerical data on the user's vocal characteristics, and the output is a specific practice plan. Specifically, the AI ​​model recommends the most suitable practice content based on past data.

[0675] Step 4:

[0676] The generated practice plan is sent from the server to the terminal and presented to the user visually and audibly. The input here is the practice plan data, and the output is the content presented to the user on the terminal. Specifically, the user interface receives the plan information and performs actions such as screen display and audio output.

[0677] Step 5:

[0678] When the user begins practicing, the device again collects audio and sends it to the server to generate scientific feedback. The server uses an AI analysis engine to perform real-time evaluations and provide feedback on pitch deviations and rhythmic irregularities. The input for this step is the user's audio data during practice, and the output is real-time feedback to the user. Specifically, the feedback content is calculated and sent to the device immediately.

[0679] Step 6:

[0680] The device collects the user's heart rate and respiratory data using sensors and sends it to the server. The server analyzes this physiological data and suggests relaxation and warm-up exercises as needed. The input for this step is physiological data, and the output is suggested exercises. Specifically, the server assesses the stress level and selects relaxation methods.

[0681] Step 7:

[0682] The server continuously records the user's practice history and generates a dashboard that visualizes their progress. The dashboard is displayed on the device, showing the user their progress. The input for this step is the user's practice data, and the output is visualized progress information. Specifically, it extracts information from the database and generates graphs.

[0683] (Application Example 1)

[0684] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0685] For users to efficiently improve their singing ability, personalized practice plans and immediate feedback are crucial. However, conventional technology struggles to provide such a high level of personalization, and there is a particular lack of systems to support practice in home environments. Furthermore, the absence of mechanical devices to effectively support users during vocal practice is also a problem.

[0686] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0687] In this invention, the server includes data acquisition means for acquiring user voice data, analysis means for analyzing the voice data and evaluating the user's acoustic and vocal characteristics, and plan generation means for generating a practice plan customized for the user based on the results evaluated by the analysis means. This provides the user with a suitable practice plan and real-time feedback, making it possible to effectively improve singing ability even in a home environment.

[0688] "Data acquisition means" refers to functions or devices for acquiring user voice data.

[0689] "Analysis means" refers to functions or devices that analyze acquired audio data and evaluate the user's acoustic characteristics and vocalization characteristics.

[0690] A "plan generation means" is a function or device that generates a customized practice plan for the user based on the evaluation results obtained by the analysis means.

[0691] A "feedback provision means" refers to a function or device that provides real-time feedback to the user based on the generated practice plan.

[0692] "Monitoring means" refers to functions or devices that monitor a user's practice history and growth trends, and visualize progress data.

[0693] "Household mechanical device operating means" refers to a function or device that uses household mechanical devices to provide users with immediate feedback based on sound and rhythm when they practice vocal exercises.

[0694] This invention relates to a system that acquires and analyzes user voice data to effectively support user voice practice and generates and provides a personalized practice plan. The system includes data acquisition means for collecting user voice data, which is acquired in real time using a microphone. The acquired voice data is transmitted to a server and analyzed by an AI voice analysis engine.

[0695] The server evaluates the user's acoustic and vocal characteristics through the analysis of audio data. Specifically, it uses analysis software developed in Python or C++, as well as libraries such as TensorFlow and PyTorch. Based on the analysis results, a practice plan suitable for the user is generated by a plan generation mechanism.

[0696] The generated training plan provides real-time feedback to the user via the device. This feedback is presented visually or audibly through the user's home-use device control system. The user's heart rate and respiratory data are also collected via sensors and incorporated into suggestions for relaxation techniques and other related data.

[0697] As a concrete example, consider a child using this system to practice singing at home. When the child practices singing, the system provides immediate feedback such as, "The pitch is a little high, try lowering it a bit," if the pitch or rhythm is off.

[0698] An example of a prompt to the generating AI model is, "Analyze the user's voice data in real time, evaluate pitch and voice quality, and generate an optimal practice plan." This allows users to continuously improve their singing ability even in a home environment.

[0699] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0700] Step 1:

[0701] When a user begins singing using the device, the device collects audio data via its microphone. The user's voice waveform is obtained as input and sent to the server.

[0702] Step 2:

[0703] The server passes the received audio data to an AI speech analysis engine, which analyzes the acoustic and vocal characteristics. Specifically, it uses FFT (Fast Fourier Transform) to analyze the frequency components of the sound field and extract pitch and rhythm. The output of this analysis is data on the quality, pitch, and rhythm of the user's voice.

[0704] Step 3:

[0705] Based on the analysis results, the server generates a customized practice plan for the user using a plan generation mechanism. The input is audio analysis data, and the output is a list of practice items and goals suitable for the user. This process is performed using a generation AI model, and prompts such as "Analyze the user's audio data in real time, evaluate pitch and voice quality, and generate the optimal practice plan" are used to determine practice content that takes the user's progress into consideration.

[0706] Step 4:

[0707] The device receives the generated practice plan and provides real-time presentation and feedback to the user. Specifically, real-time guidance is provided to the user through visual displays and voice messages. Voice data and the practice plan are used as input for feedback, and progress and areas for improvement are communicated to the user as output.

[0708] Step 5:

[0709] The device monitors the user's heart rate and respiratory data in real time and suggests relaxation techniques as needed. Input is data acquired by physiological data sensors, and output provides the user with instructions and suggestions regarding relaxation.

[0710] Step 6:

[0711] The server stores the user's practice history and monitors long-term growth trends. Past practice data is stored on the server as input, and information visualizing the growth trend is generated on a dashboard for the user as output. This allows users to visually understand their own progress.

[0712] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0713] This invention enhances the effectiveness of vocal training by adding a function to recognize the user's emotions to a system that provides users with personalized practice plans and real-time feedback during voice training. This aims to improve not only the quality of the user's voice but also their overall singing ability, including emotional expression.

[0714] The system first collects the user's voice and biometric data via the terminal. When the user speaks into the terminal, the voice data is sent to the server. Once the voice data reaches the server, it is input into an AI voice analysis engine, where the voice quality and vocal characteristics are analyzed in detail.

[0715] In addition, this system incorporates an emotion engine that analyzes the user's emotional state from voice data. The emotion engine identifies the user's emotions based on factors such as voice intonation and rhythm, as well as heart rate fluctuations obtained from biometric data.

[0716] Based on the analysis results, the server generates a personalized practice plan tailored to the user's emotions and voice quality. This practice plan includes specific exercises that utilize the user's current emotional state, enabling them to improve their emotional expression.

[0717] The generated practice plan is sent from the server to the device and presented to the user on the device. The device also receives feedback from the server during practice and notifies the user in real time. The feedback includes not only areas for improvement in voice but also evaluations of the emotions being expressed.

[0718] For example, if a user is practicing an opera aria, the system will grasp the emotions embedded in the piece and provide a practice plan based on that. If the user struggles with expressing those emotions at a certain stage, the system can provide precise advice based on the analysis results of its emotion engine. In this way, users can hone not only their musical technique but also their expressive abilities simultaneously.

[0719] The following describes the processing flow.

[0720] Step 1:

[0721] The user performs an emotionally charged voice response to the device. The device collects biometric data, such as the user's heart rate, along with the voice data.

[0722] Step 2:

[0723] The device sends the collected voice and biometric data to the server. The server inputs the received data into its AI voice analysis engine and begins the analysis.

[0724] Step 3:

[0725] The server performs speech analysis to evaluate vocal characteristics such as voice quality, pitch, and rhythm. Simultaneously, it uses an emotion engine to analyze the user's emotional state from the speech data. This includes changes in intonation and rhythm, as well as fluctuations in biometric data.

[0726] Step 4:

[0727] Based on the analysis results, the server generates a practice plan optimized for the user's emotions and vocal characteristics. This plan includes specific exercises that take the user's emotional state into account, as well as content aimed at improving expressive techniques.

[0728] Step 5:

[0729] The server sends the generated practice plan to the terminal. The terminal receives the practice plan and presents it to the user visually and audibly. The user performs vocal exercises according to the instructed practice plan.

[0730] Step 6:

[0731] While the user is practicing, the device continuously collects voice and biometric data and sends it to the server in real time. The server generates feedback based on this data.

[0732] Step 7:

[0733] The server generates feedback, including technical improvements to the user's voice and advice on emotional expression. This feedback includes specific suggestions for improvement and advice on how to express emotions effectively.

[0734] Step 8:

[0735] The device provides users with real-time feedback from the server, allowing them to immediately apply it to their practice. For example, it might suggest specific ways to emphasize areas where emotional expression is lacking.

[0736] Step 9:

[0737] The server accumulates all of the user's practice data over the long term and analyzes changes in emotional state and technical growth trends. This data is provided to the user and can be used to improve their next practice session.

[0738] Based on this feedback and progress data, users can set new goals to improve their expressive abilities and continue practicing.

[0739] (Example 2)

[0740] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0741] Conventional voice training systems primarily focused on voice quality and technical aspects, limiting their ability to improve users' emotional expression. Furthermore, they faced challenges in appropriately evaluating the impact of the user's physical and mental state on the training's effectiveness and incorporating this into the training plan.

[0742] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0743] In this invention, the server includes data collection means for acquiring the user's voice data and biometric data, analysis means for analyzing voice and emotions and evaluating the user's voice quality and emotional state, and generation means for generating a practice plan that utilizes the user's emotions. This makes it possible to improve the user's voice quality and simultaneously enhance their emotional expression.

[0744] "Data collection means" refers to a device or method for acquiring a user's voice data and biometric data.

[0745] "Analysis methods" refer to processes and technologies for evaluating the user's voice quality and vocal characteristics from acquired audio data, and further analyzing their emotions.

[0746] "Emotional analysis means" refers to a device or method for identifying a user's emotional state based on voice intonation and biometric data.

[0747] "Generation means" refers to an apparatus or method for generating a personalized practice plan for a user based on the analysis results.

[0748] A "feedback device" is a device or method that provides real-time evaluation of voice and emotional expression in accordance with the generated practice plan.

[0749] "Monitoring means" refers to a device or method for tracking a user's practice history and growth trends, and for visualizing their progress.

[0750] This system aims to support users' voice training and improve their overall abilities, including emotional expression. Users input voice data through a device, and this voice data and biometric data are collected. Data collection is performed using a microphone and heart rate sensor built into the device.

[0751] The device sends this data to the server, which first uses an AI voice analysis engine to analyze the quality, tone, and rhythm of the voice. This voice analysis utilizes machine learning algorithms. Furthermore, an emotion analysis engine analyzes intonation, rhythm, vital data, etc., from the voice data to identify the user's emotional state at that moment.

[0752] Based on the analysis results, the server generates a practice plan optimized for the user. This plan includes content that takes advantage of the user's emotional state during practice. For example, if relaxation is needed, vocal exercises that promote relaxation can be incorporated.

[0753] The practice plan is sent from the server to the terminal and presented to the user. The terminal screen displays the practice steps and provides voice guidance. Voice and biometric data during practice are sent back to the server to provide real-time feedback.

[0754] For example, if a user enters a prompt such as, "I want to sing opera arias with more emotion. Please advise me on what kind of practice I should do," the system can instantly generate and present a practice plan that meets that request.

[0755] This system allows users to simultaneously improve their voice skills and emotional expression. Furthermore, by incorporating the latest trends in the music field, it allows for the continuous adoption of new training methods.

[0756] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0757] Step 1:

[0758] The user speaks into the device. The device captures the voice data with its built-in microphone and simultaneously collects biometric data such as heart rate using sensors. Both voice data and biometric data are acquired as input. This provides basic data to understand the user's vocalizations.

[0759] Step 2:

[0760] The terminal transmits the collected voice and biometric data to the server. The data is encrypted using a secure protocol and safely transferred to the server. The input here is encrypted voice and biometric data, while the output to the server is transmitted in its original data format.

[0761] Step 3:

[0762] The server inputs the received audio data into an AI speech analysis engine. This analysis engine uses deep learning algorithms to analyze the quality, tone, and rhythm of the speech in detail. The input is the audio data sent to the server, and the output is the speech characteristic information as a result of the analysis.

[0763] Step 4:

[0764] The server simultaneously analyzes voice and biometric data using an emotion analysis engine to identify the user's emotional state. The input includes voice intonation, pacing, and heart rate changes, while the output provides an evaluation of the user's emotional state.

[0765] Step 5:

[0766] The server integrates the results of voice analysis and emotion analysis to generate a personalized practice plan that reflects the user's emotions. The input is the analysis results, and the output is an individually customized practice plan. This provides the user with practice content optimized for their needs.

[0767] Step 6:

[0768] The server sends the generated practice plan to the terminal, and the terminal presents the plan to the user. At this time, the practice steps are displayed on the terminal as visual and audio guides. The input is the practice plan, and the output is the training content presented to the user.

[0769] Step 7:

[0770] The user practices according to the practice plan provided on the device. The device collects voice and biometric data again and sends it to the server in real time. The server generates feedback based on this data and notifies the user in real time through the device. The input is the user's practice data, and the output is feedback including areas for improvement and evaluation.

[0771] (Application Example 2)

[0772] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0773] Conventional voice training systems focused solely on the technical aspects of voice, failing to adequately improve users' emotional expression. Furthermore, the lack of feedback functions tailored to emotional states made it difficult for users to obtain practice plans that were expressive and emotionally responsive. This left the overall improvement of users' singing abilities as a challenge.

[0774] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0775] In this invention, the server includes data collection means for acquiring user voice data and biometric information, analysis means for analyzing the voice data, evaluating voice quality and vocalization characteristics, and recognizing emotional states, and generation means for generating a personalized practice plan based on the analysis results, including improving emotional expression. This allows the user to receive real-time feedback not only on voice technology but also on emotional expression, enabling comprehensive improvement of singing ability.

[0776] "Data collection means" refers to a device or method for acquiring a user's voice data and biometric information.

[0777] "Analysis means" refers to a device or method that uses acquired audio data to evaluate the user's voice quality and vocalization characteristics, and further recognizes the user's emotional state from the audio.

[0778] "Generation means" refers to a device or method for creating a personalized practice plan tailored to the user's emotional state based on analysis results, with the aim of improving emotional expression skills.

[0779] A "feedback device" is a device or method that provides the user with real-time suggestions for improvement in voice and emotional expression based on the generated practice plan.

[0780] "Monitoring means" refers to a device or method that tracks a user's practice history and growth trends, visualizes progress data, and evaluates it, including changes in emotional expression.

[0781] The system implementing the present invention is configured to effectively utilize the user's voice data and biometric information to comprehensively improve vocalization and emotional expression.

[0782] First, the user speaks into a device equipped with a microphone and sensors. This device could be a smartphone, tablet, or personal computer. The device collects voice data and biometric information such as heart rate and respiration, and this data is transmitted to a server via the network.

[0783] The server uses an AI speech analysis engine to analyze voice quality and vocal characteristics in detail in order to process speech in real time. Furthermore, to recognize emotional states, an emotion engine analyzes the intonation, rhythm, and acquired biometric information of the speech. These analysis processes are generally performed using GPUs and high-performance computing environments.

[0784] Subsequently, the server's generation mechanism creates an optimal practice plan for the user based on the analysis results. This plan includes practice content aimed at improving expressiveness, taking into account the user's emotional state. The generated practice plan is immediately sent to the terminal and presented to the user. The terminal provides the user with real-time feedback, offering advice on areas for improvement in voice and emotional expression.

[0785] As a concrete example, consider a user practicing opera arias on their days off. The system grasps the emotions embedded in the piece and provides precise advice when the user struggles to express those emotions. In this way, users can simultaneously hone not only their musical technique but also their ability to express emotions richly.

[0786] A concrete example of a prompt for a generative AI model would be: "Please design an application that analyzes voice data and recognizes user emotions in real time. This application will be installed on a consumer robot and will optimize user feedback."

[0787] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0788] Step 1:

[0789] The device acquires voice data and biometric information simultaneously when the user speaks into the microphone. It collects voice signals and data such as heart rate and respiration as input, generating an initial digital voice file and biometric signal data. This information is then transmitted to a server via the network.

[0790] Step 2:

[0791] The server inputs the received audio data into an AI voice analysis engine to analyze voice quality and vocal characteristics. This process involves spectral analysis of the audio waveform to extract features such as pitch, tone, and volume. The analysis output provides detailed data on the technical characteristics of the user's voice.

[0792] Step 3:

[0793] The server uses an emotion engine to identify the user's emotional state from the voice data while simultaneously performing voice analysis. The input includes intonation, rhythm, and biometric data from the voice data. The emotion engine analyzes this data to identify emotional categories (e.g., joy, sadness, surprise). The result is output as data indicating the user's emotional state.

[0794] Step 4:

[0795] The server generates a personalized practice plan based on voice analysis results and emotional state. This planning takes into account the user's current technical ability and emotional needs, personalizing music selection and practice intensity. The generated practice plan is then sent to the device.

[0796] Step 5:

[0797] The device presents the user with a generated practice plan to apply during practice. It uses audio playback and text instructions to show the user specific practice content. The output of this step is visual or auditory information that guides the user through the next session.

[0798] Step 6:

[0799] The device monitors the user's performance during practice and provides real-time feedback from the server. It notifies the user of areas for improvement based on audio data or practice results, encouraging improvements in musical techniques, including emotional expression.

[0800] Step 7:

[0801] The server continuously records the user's practice history and analyzes growth trends. It uses past practice data as input to evaluate practice frequency and results. Based on this data, it generates progress reports that visualize progress, enabling users to self-evaluate their performance.

[0802] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0803] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0804] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0805] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0806] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0807] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0808] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0809] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0810] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0811] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values ​​representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values ​​representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0812] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0813] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0814] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0815] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0816] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0817] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0818] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0819] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0820] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0821] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0822] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0823] The following is further disclosed regarding the embodiments described above.

[0824] (Claim 1)

[0825] A data collection method for acquiring user voice data,

[0826] An analysis means for analyzing the aforementioned audio data and evaluating the user's voice quality and vocalization characteristics,

[0827] A generation means that generates a personalized practice plan for the user based on the analysis results obtained by the analysis means,

[0828] A feedback means that provides real-time feedback according to the practice plan generated by the generation means,

[0829] A monitoring method that monitors the user's practice history and growth trends, and visualizes progress data,

[0830] A system that includes this.

[0831] (Claim 2)

[0832] The system according to claim 1, wherein the feedback means monitors the user's heart rate and respiratory data and suggests relaxation techniques.

[0833] (Claim 3)

[0834] The system according to claim 1, wherein the generation means analyzes the latest trends in the music industry and takes them into consideration when updating the user's practice plan.

[0835] "Example 1"

[0836] (Claim 1)

[0837] A data collection means that collects user voice data and transmits the digitized voice information to a server,

[0838] Using the aforementioned audio information, an analysis means is provided in which an AI voice analysis engine evaluates the characteristics of the voice and musical indicators.

[0839] A generation means for creating individual practice strategies for users based on the evaluation results obtained by the analysis means,

[0840] A feedback means that provides real-time evaluation and modification suggestions for practice content according to the practice strategy created by the generation means,

[0841] A monitoring system that evaluates the user's physical and mental state using physiological data and recommends appropriate warm-up exercises before practice,

[0842] A visualization method that records the user's practice progress and visually presents long-term progress,

[0843] A system that includes this.

[0844] (Claim 2)

[0845] The system according to claim 1, wherein the feedback means provides detailed correction instructions in real time based on the user's musical expressiveness.

[0846] (Claim 3)

[0847] The system according to claim 1, wherein the generation means analyzes the technical requirements of different music genres and reflects them in the user's practice plan.

[0848] "Application Example 1"

[0849] (Claim 1)

[0850] A data acquisition method for acquiring user voice data,

[0851] An analysis means for analyzing the aforementioned audio data and evaluating the user's acoustic characteristics and vocalization characteristics,

[0852] A plan generation means that generates a practice plan customized for the user based on the results evaluated by the analysis means,

[0853] A feedback provision means provides real-time feedback based on the practice plan generated by the aforementioned plan generation means,

[0854] A monitoring system that monitors the user's practice history and growth trends, and visualizes progress data,

[0855] In a scenario where a user practices vocal exercises using a home-use device, a means for operating a home-use device provides the user with immediate feedback based on sound and rhythm,

[0856] A system that includes this.

[0857] (Claim 2)

[0858] The system according to claim 1, wherein the feedback providing means monitors the user's physiological data and proposes a mitigation technique.

[0859] (Claim 3)

[0860] The system according to claim 1, wherein the plan generation means analyzes the latest trends in the music field and takes them into consideration when updating the user's practice plan.

[0861] "Example 2 of combining an emotion engine"

[0862] (Claim 1)

[0863] A data collection means for acquiring user voice data and biometric data,

[0864] An analysis means for analyzing the aforementioned audio data and evaluating the user's voice quality and vocalization characteristics,

[0865] In addition to the analysis results obtained by the aforementioned analysis means, an emotion analysis means is provided to perform emotion analysis and identify the user's emotional state.

[0866] Based on the results of the aforementioned analysis and emotion analysis, a generation means generates a personalized practice plan that incorporates emotions for the user,

[0867] A feedback means provides real-time feedback and evaluates voice and emotional expression according to the practice plan generated by the generation means,

[0868] A monitoring method that monitors the user's practice history and growth trends, and visualizes progress data,

[0869] A system that includes this.

[0870] (Claim 2)

[0871] The system according to claim 1, wherein the feedback means monitors the user's heart rate and respiratory data and suggests relaxation techniques.

[0872] (Claim 3)

[0873] The system according to claim 1, wherein the generation means analyzes the latest trends in the music field and takes them into consideration when updating the user's practice plan.

[0874] "Application example 2 when combining with an emotional engine"

[0875] (Claim 1)

[0876] A data collection method for acquiring user voice data and biometric information,

[0877] The aforementioned audio data is analyzed to evaluate the user's voice quality and vocalization characteristics, and to recognize the emotional state from the audio.

[0878] Based on the analysis results obtained by the aforementioned analysis means, a personalized practice plan corresponding to the user's emotional state is generated, and a generation means including expressive ability improvement is provided.

[0879] A feedback means that provides real-time improvements to voice and emotional expression according to the practice plan generated by the generation means,

[0880] A monitoring system that monitors the user's practice history and growth trends, visualizes progress data, and evaluates changes in emotional expression,

[0881] A system that includes this.

[0882] (Claim 2)

[0883] The system according to claim 1, wherein the feedback means monitors the user's heart rate and respiratory data, suggests relaxation techniques, and provides feedback based on the emotional state.

[0884] (Claim 3)

[0885] The system according to claim 1, wherein the generation means analyzes the latest trends in the music industry and takes them into consideration when updating the user's practice plan, and also takes into consideration trends that reflect emotional expression. [Explanation of Symbols]

[0886] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A data acquisition method for acquiring user voice data, An analysis means for analyzing the aforementioned audio data and evaluating the user's acoustic characteristics and vocalization characteristics, A plan generation means that generates a practice plan customized for the user based on the results evaluated by the analysis means, A feedback provision means provides real-time feedback based on the practice plan generated by the aforementioned plan generation means, A monitoring system that monitors the user's practice history and growth trends, and visualizes progress data, In a scenario where a user practices vocal exercises using a home-use device, a means for operating a home-use device provides the user with immediate feedback based on sound and rhythm, A system that includes this.

2. The system according to claim 1, wherein the feedback provision means monitors the user's physiological data and proposes a mitigation technique.

3. The system according to claim 1, wherein the plan generation means analyzes the latest trends in the music field and takes them into consideration when updating the user's practice plan.