system
A voice analysis system evaluates singing ability and emotional state to generate personalized practice plans, addressing the limitations of conventional voice training by providing efficient, at-home solutions for improving singing skills.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- SOFTBANK GROUP CORP
- Filing Date
- 2024-12-09
- Publication Date
- 2026-06-19
Smart Images

Figure 2026100700000001_ABST
Abstract
Description
Technical Field
[0001] The technology of the present disclosure relates to a system.
Background Art
[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of the chatbot's character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance as a response to the user utterance.
Prior Art Documents
Patent Documents
[0003]
Patent Document 1
Summary of the Invention
Problems to be Solved by the Invention
[0004] When a user wants to improve their singing skills, there is a problem that the environment for receiving appropriate guidance is limited. In conventional voice training, face-to-face lessons are the mainstream, and due to cost and time constraints, it has been difficult for individual users to easily use. Also, since there are few options to provide an individual practice plan optimized for the characteristics of the user, there has been a lack of efficient methods for improving singing ability.
Means for Solving the Problems
[0005] This invention provides a system that acquires audio data from a user and evaluates singing ability by analyzing the audio data. Based on the evaluation results, this system generates a practice plan tailored to the user and provides the user with this practice plan. This allows users to receive individually optimized voice training at home at a low cost, enabling them to efficiently improve their singing ability.
[0006] "Audio data" refers to the recorded audio provided by the user to the system, and is digital data containing basic information for evaluating singing ability.
[0007] "Receiving means" refers to technical means for acquiring audio data from a user and converting or storing it in a format that can be processed within the system.
[0008] "Analysis" or "evaluation means" refers to a technology that evaluates the content of received audio data according to indicators such as pitch, rhythm, and voice quality, and quantifies or classifies singing ability.
[0009] "Generation means" refers to technical means for designing practice plans optimized for the user's characteristics based on data obtained from evaluation means.
[0010] "Means of delivery" refers to an interface or protocol that communicates the generated practice plan to the user, and enables the user to access and implement it.
[0011] "System" refers to a series of processes or devices including receiving means, evaluation means, generating means, and providing means, and describes a mechanism in which these work together to support the improvement of the user's singing ability. [Brief explanation of the drawing]
[0012] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2]This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14] This is a sequence diagram showing the processing flow of the data processing system in Application Example 2, which combines an emotion engine. [Modes for carrying out the invention]
[0013] Hereinafter, an example of an embodiment of the system relating to the technology of this disclosure will be described with reference to the attached drawings.
[0014] First, the terms used in the following description will be explained.
[0015] In the following embodiments, a processor with a reference number (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.
[0016] In the following embodiments, a RAM (Random Access Memory) with a reference number is a memory in which information is temporarily stored and is used as a work memory by the processor.
[0017] In the following embodiments, a storage with a reference number is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, etc.
[0018] In the following embodiments, a communication I / F (Interface) with a reference number is an interface including a communication processor and an antenna, etc. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark), etc.
[0019] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."
[0020] [First Embodiment]
[0021] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.
[0022] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.
[0023] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0024] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.
[0025] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.
[0026] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.
[0027] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.
[0028] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.
[0029] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.
[0030] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0031] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0032] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0033] This invention is a voice analysis system designed to help users improve their singing ability. The system acquires voice data and analyzes pitch, rhythm, and voice quality to provide the user with an optimal practice plan.
[0034] First, the user records their singing voice using a dedicated application. This application runs on smartphones and PCs, and allows users to input audio data through its user interface. Once recording is complete, the device receives this audio data and sends it to a server via the internet.
[0035] The server analyzes the received audio data based on an AI model. The AI model evaluates various metrics in the audio and generates analysis results for pitch, rhythm, and voice quality. These results are used to quantify the user's singing ability.
[0036] Next, the server generates an individualized practice plan based on the analysis results. This practice plan includes specific exercises and training content designed to strengthen the user's weaknesses and further develop their strengths, compared to past data. The generated plan is sent from the server to the terminal and provided to the user.
[0037] The device visually displays the received practice plan to the user within the application. The user can review this feedback and improve their singing technique by incorporating the suggested practice into their daily life. This allows users to efficiently improve their singing ability without being limited by time or location.
[0038] For example, if a user wants to improve their pitch accuracy in karaoke, they can use this system to analyze their singing voice and receive a pitch-focused practice plan. This allows the user to learn a specific pitch practice approach and focus on their own challenges.
[0039] The following describes the processing flow.
[0040] Step 1:
[0041] The user launches a dedicated application, selects a song they want to sing, and begins recording. The user then presses the record button on their device to record their singing voice.
[0042] Step 2:
[0043] Once recording is complete, the device temporarily stores the recorded audio data in digital format. This data is saved at the required accuracy and sample rate for later analysis.
[0044] Step 3:
[0045] The device generates a communication request to send the stored audio data to the server and sends this data to the server via the internet.
[0046] Step 4:
[0047] The server inputs the received audio data into an AI model for analysis. The AI model analyzes the audio data and extracts singing indicators such as pitch, rhythm, and vocal quality.
[0048] Step 5:
[0049] The server quantifies the user's singing ability based on the analysis results of the AI model. The resulting numerical score objectively evaluates the user's performance.
[0050] Step 6:
[0051] Based on the evaluation results, the server compares them with past data to generate a practice plan tailored to the user's characteristics. This plan includes specific practice exercises and recommended songs.
[0052] Step 7:
[0053] The server sends the generated practice plan and evaluation results to the terminal.
[0054] Step 8:
[0055] The device displays the received practice plan and evaluation results in the user interface, allowing the user to review the content. The user then uses this information to begin training and aim to improve their singing ability.
[0056] (Example 1)
[0057] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0058] The present invention aims to provide a method for users to effectively improve their voice and singing skills. Current technology makes it difficult to generate training plans based on the individual characteristics of each user, and standard plans are often applied. Therefore, there is a challenge in that efficient training that takes into account the weaknesses and strengths of individual users cannot be achieved.
[0059] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0060] In this invention, the server includes a receiving means for acquiring voice information from a user, an evaluation means for analyzing the received voice information and evaluating singing skills, and a generating means for generating a training plan suitable for the user based on the analysis results. This makes it possible to provide a training plan optimized for each individual user.
[0061] "Voice information" refers to data that records the content of a user's speech and represents it in digital format.
[0062] A "user" is an individual who wishes to have their voice analyzed and trained through this system.
[0063] "Receiving means" refers to a device or program that has the function of acquiring voice information transmitted by a user and incorporating it into the system.
[0064] "Analysis" is the process of extracting and evaluating indicators such as pitch, rhythm, and sound quality from received audio information using a specific algorithm.
[0065] "Singing ability" is a general term for the musical expressiveness and technique possessed by the user.
[0066] "Evaluation means" refers to a device or program that has the function of objectively evaluating a user's singing ability based on data obtained through analysis.
[0067] "Generation means" refers to a device or program that has the function of constructing individual training plans based on evaluation results.
[0068] A "training plan" is a learning program that combines specific exercises and practice content with the aim of improving the user's singing skills.
[0069] "Means of delivery" refers to a device or program that has the function of presenting the generated training plan to the user visually or audibly.
[0070] A "generative artificial intelligence model" is a model built using machine learning algorithms and used for analyzing speech information.
[0071] This invention is a voice analysis system designed to help users improve their singing skills. This system supports the improvement of musical expressiveness by acquiring and analyzing the user's voice information and providing an appropriate training plan. Embodiments of this invention are described below.
[0072] Users record audio information using a dedicated application installed on their smartphone or PC. This application features a user-friendly interface, allowing users to easily record audio and send it to the system with the press of a button. The recorded audio information is transmitted from the device to the server via the internet. During transmission, the data is encrypted to ensure communication security.
[0073] The server stores the received audio information and uses a generative AI model to analyze pitch, rhythm, and tone quality. This AI model utilizes speech recognition technology to extract features from the audio data and evaluate singing ability. After analysis, the server generates a training plan tailored to the user's characteristics based on the evaluation results. This training plan includes specific training content to improve musical skills and is optimized considering the user's past data and similar datasets.
[0074] The generated training plan is sent from the server to the terminal and presented to the user visually within the application. Based on this feedback, the user can practice regularly, enabling them to efficiently improve their singing skills.
[0075] For example, if a user wants to improve their pitch accuracy, they can use the system to analyze their voice information and receive a pitch-specific training plan. By practicing according to this plan, the user can focus on specific musical challenges.
[0076] An example of a prompt would be, "Analyze my singing voice and create a training plan that clearly identifies areas for improvement." Through this prompt, the results of the speech analysis performed by the generated AI model can be optimally utilized.
[0077] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0078] Step 1:
[0079] The user launches a dedicated application and records their singing using the microphone on their smartphone or PC. The process begins by pressing the record button, and ends when the user taps the end button. During this step, the user's voice information is converted from an analog signal to a digital signal and saved to the device.
[0080] Step 2:
[0081] The terminal sends the recorded audio information file to the server via the internet. During transmission, the audio information is encrypted to ensure it reaches the server securely. The input for this step is a digital audio file, and the output is stored as secure data on the server.
[0082] Step 3:
[0083] The server decodes the received audio information in preparation for analysis and inputs it into a generating AI model. This model analyzes the audio information and generates indices to determine pitch, rhythm, and sound quality. The input is the decoded digital audio information, and the output is the analyzed singing skill index data.
[0084] Step 4:
[0085] Based on the analysis results, the server generates a training plan tailored to the user. It designs individually customized training plans using evaluation metrics. This process references historical user data and relevant datasets. The input is the analysis metric data, and the output is the generated training plan.
[0086] Step 5:
[0087] The server sends the generated training plan to the terminal. This plan is converted into a format that can be easily displayed on the user page or within the application. The transmitted data is encrypted to ensure the security of the communication. The input is the training plan data, and the output is the training plan stored on the terminal.
[0088] Step 6:
[0089] The device visually displays the received training plan to the user through the application interface. The user can perform daily training by reviewing the training content on the screen and following the instructions. The input for this step is the training plan stored on the device, and the output is converted into the user's own actions.
[0090] (Application Example 1)
[0091] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0092] In music and singing training, there is a challenge in obtaining real-time feedback tailored to individual vocal characteristics and weaknesses. Especially in self-study settings, a lack of specialized knowledge and training is common, hindering effective practice. Therefore, there is a need to provide means to effectively improve singing technique.
[0093] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0094] In this invention, the server includes a receiving means for acquiring voice data from the user, an evaluation means for analyzing the received voice data and evaluating the voice technique, and a generating means for generating a practice plan suitable for the user based on the analysis results. This enables the robot to record the user's vocalizations on the spot and provide evaluations in real time, thereby allowing for effective and continuous improvement of singing technique.
[0095] "Voice data" is a collection of digital signals that are recordings of the user's speech.
[0096] A "user" is an individual who wishes to improve their singing technique using this invention.
[0097] "Receiving means" refers to the process or device used to acquire audio data from the user.
[0098] "Evaluation method" refers to the process of analyzing received audio data and expressing voice skills using numerical values or indicators based on the results.
[0099] "Generative means" refers to the process of creating individual practice plans based on evaluation results.
[0100] A "practice plan" includes guidelines for individually customized exercises and training designed to improve the user's singing technique.
[0101] "Means of delivery" refers to the processes and devices used to communicate the generated practice plan to the user.
[0102] A "robot" refers to an autonomous machine or device that provides voice recording and feedback in real time.
[0103] "Real-time" refers to a process where data is processed and feedback is provided within a timeframe close to the moment it is acquired.
[0104] The system for realizing this invention requires a series of processes for acquiring, analyzing, evaluating, and providing feedback on audio data.
[0105] First, the user acquires voice data through the robot. The robot uses its built-in microphone to record the user's speech. This recorded data is then sent in real time to a locally built-in AI model.
[0106] The server analyzes the audio data received from the robot. This analysis utilizes generative AI models based on libraries such as TENSORFLOW® and PyTorch. These models evaluate pitch, rhythm, and voice quality from the audio data and quantify the results.
[0107] Based on the evaluation results, the server generates an optimal training plan for the user. This training plan includes specific training instructions to address the user's weaknesses, compared to past training data. A database management system (DBMS) is used in this generation process to compare the current data with historical data.
[0108] Finally, the terminal provides the user with the generated practice plan. The robot provides real-time voice feedback using its built-in speaker and visually displays instructions and analysis results on its screen.
[0109] For example, if a user wants to improve the accuracy of a particular pitch, the robot can provide real-time feedback based on the recorded data, offering specific advice such as, "Let's aim for a slightly higher pitch." In this process, the prompt message would include instructions such as, "Analyze the following audio data and evaluate the pitch. Based on the analysis results, generate feedback and return specific advice regarding the pitch."
[0110] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0111] Step 1:
[0112] The user records their voice using the robot's built-in microphone. The input is the user's voice, and the output is saved as audio data. The robot converts the audio data into a digital format and prepares it for analysis.
[0113] Step 2:
[0114] The device sends recorded audio data to a server for analysis by a generating AI model. The input is audio data, and the AI model evaluates pitch, rhythm, and voice quality. The output is the analyzed data, with each metric represented numerically. Specifically, frameworks such as TensorFlow and PyTorch are used to extract features from the audio data.
[0115] Step 3:
[0116] The server generates an optimal training plan for the user based on the analysis results. The input consists of the analyzed data and past training data, and the output is a customized training plan. Data processing involves comparison with similar data sets, and individual training instructions are incorporated.
[0117] Step 4:
[0118] The device provides the user with a generated practice plan and offers audio and visual feedback. The input is the practice plan, and the output is notification information for the user. Specifically, it provides audio feedback through its built-in speaker and displays the practice content on the screen.
[0119] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0120] This invention provides a technology that combines an emotion engine with a voice analysis system that supports the improvement of users' singing abilities. The system acquires the user's singing voice, generates evaluations and practice plans based on that voice data, and at the same time identifies the user's emotional state and provides feedback and adjusts the plan accordingly.
[0121] Users record their singing voices using a dedicated application. This recording data is temporarily stored on the device and then sent to a server. The server analyzes the received data using an AI model to evaluate singing ability. It quantifies pitch, rhythm, voice quality, etc., and generates a numerical score.
[0122] Furthermore, the server analyzes changes in the user's voice tone, pitch, and speed, allowing the emotion engine to identify the user's emotional state. Based on this emotional data, it understands the user's mood while singing. This information is useful in generating practice plans, making it possible to suggest practice content tailored to specific emotional states.
[0123] Taking into account the emotional state identified by the emotion engine, the server generates a personalized practice plan. For example, if the user is determined to be stressed, it can provide vocal exercises to promote relaxation or training using familiar music. It can also provide positive feedback to boost the user's motivation.
[0124] Finally, the server sends the generated practice plan and emotion-based feedback to the device. The device displays these to the user on the application, providing guidance for the user to actually train. In this way, the user can train in a way that is optimized for them, receiving support not only for improving their singing ability but also emotionally.
[0125] For example, if a user uses this system to calm their nerves before a presentation, the emotional engine can detect their anxiety and suggest practice methods to promote relaxation. This allows the user to effectively improve both their mental state and their skills.
[0126] The following describes the processing flow.
[0127] Step 1:
[0128] The user launches a dedicated application and sings a selected song using the recording function. After finishing singing, the user presses the stop button, saving the singing voice as digital audio data to the device.
[0129] Step 2:
[0130] The terminal prepares to send the stored audio data to the server and transmits the data to the server via a communication protocol.
[0131] Step 3:
[0132] The server begins processing the received audio data. Using an AI model, it extracts indicators of pitch, rhythm, and vocal quality, and generates a numerical score related to singing ability.
[0133] Step 4:
[0134] The server further uses an emotion engine to identify the user's emotional state based on the tone, pitch, and speed of the voice extracted from the audio data. This emotional information is used as supplementary data for generating practice plans.
[0135] Step 5:
[0136] The server generates a personalized practice plan, taking into account the analysis results and the user's emotional information. This plan is optimized to improve specific singing techniques and is tailored to the user's mental state.
[0137] Step 6:
[0138] The generated practice plan and emotion-based feedback are sent from the server to the device. The device displays this information within the user interface, allowing the user to review the content.
[0139] Step 7:
[0140] Users review the displayed feedback and practice plan, and then follow the instructions to train. By repeating this process, users can improve their singing technique and emotional control.
[0141] (Example 2)
[0142] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0143] Conventional voice analysis systems can evaluate a user's singing ability, but they have difficulty providing training plans that take into account the user's emotional state. As a result, they have been unable to provide optimal training for the user, and have faced the challenge of not being able to effectively improve their singing ability.
[0144] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0145] In this invention, the server includes a receiving means for acquiring voice information from the user, an evaluation means for analyzing the received voice information and evaluating singing ability, and an identification means for analyzing the voice information and identifying the user's emotional state. This makes it possible to provide a more personalized training plan that takes into account the user's emotional state in addition to their singing ability.
[0146] "Voice information" refers to data obtained by recording the user's voice and expressing it in a format that the system can analyze.
[0147] A "user" is an individual who uses a voice analysis system with the aim of improving their own singing ability.
[0148] "Receiving means" refers to a method or device for the system to acquire voice information transmitted by a user.
[0149] "Analysis" refers to the analytical process performed by the system to evaluate and identify singing ability and emotional state based on acquired audio information.
[0150] "Evaluation means" refers to a method or apparatus for analyzing audio information and expressing singing ability as a numerical value or indicator.
[0151] "Generation means" refers to a method or apparatus for creating a training plan suitable for the user based on the analysis results.
[0152] "Means of delivery" refers to a method or device for presenting the generated training plan to the user and using it as guidance for the training.
[0153] "Identification means" refers to a method or device for identifying a user's emotional state based on voice information.
[0154] "Adjustment means" refers to a method or apparatus for optimizing a training plan based on identified emotional states.
[0155] "Emotional state" refers to the psychological or emotional condition determined based on the user's vocal characteristics.
[0156] This invention relates to a voice analysis system that analyzes a user's singing ability and emotional state and provides an individualized training plan. The system operates primarily with the involvement of a server, a terminal, and the user.
[0157] First, users install a dedicated application on their device and use the recording function to collect their singing voice. The recorded audio data is temporarily stored on the device and then sent to a server via a communication network. At this time, the device appropriately encodes the audio information and sends the data to the server using a secure protocol.
[0158] The server first inputs the received audio information into a generating AI model. The model analyzes technical elements such as pitch, rhythm, and tone quality, and quantifies and evaluates the user's singing ability. This evaluation helps identify areas where the user should improve. Following the evaluation, the tone, pitch, and speed of the audio information are analyzed, and an emotion engine is used to identify the user's emotional state. This step identifies the user's current emotional state.
[0159] Next, the server generates an optimized training plan for each user based on the analysis results and emotional state. For example, if tension is detected, it provides a plan that includes exercises to promote relaxation. This training plan helps users improve their singing ability efficiently and continuously.
[0160] Finally, the generated training plan and feedback are sent to the device and made available to the user within the application. The user can then refer to this information to perform appropriate training at home or elsewhere.
[0161] As a concrete example, let's look at an example of a prompt message.
[0162] "Evaluate this singing voice, analyze the user's emotions based on their tone and pitch, and then propose an optimal voice training plan tailored to those emotions."
[0163] In this way, the system allows users to achieve not only technical singing skills but also emotional growth.
[0164] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0165] Step 1:
[0166] The user records their singing voice using a dedicated application. The recorded audio information is temporarily stored on the device. At this stage, the input is the user's raw singing voice, which is stored on the device as digital audio data. Specifically, the audio is captured through the device's microphone and encoded into WAV or MP3 format data.
[0167] Step 2:
[0168] The device sends the stored audio data to the server. During transmission, the audio data is appropriately encoded and encrypted before being sent to the server. The input for this step is the encoded audio data, and the output is the data as it arrives at the server. Specifically, network protocols (e.g., HTTPS) are used to ensure data security during transmission.
[0169] Step 3:
[0170] The server analyzes the received audio data. Using the received audio data as input, the server's AI model analyzes pitch, rhythm, and sound quality, generating a numerical score. The output provides a detailed numerical evaluation for each metric. Specifically, the AI model uses deep learning techniques to analyze the data and evaluate singing skills.
[0171] Step 4:
[0172] The server uses an emotion engine to identify the user's emotional state from the voice data. The input is changes in tone, pitch, and speed of the voice, and the output is the identified emotional state of the user. Specifically, an emotion analysis algorithm using voice features is employed.
[0173] Step 5:
[0174] The server generates an optimal training plan for the user based on the analysis results and emotional state. The input is the singing skill evaluation and emotional identification results obtained so far, and the output is an individualized training plan. Specifically, the generating AI model combines each piece of data to construct the training content.
[0175] Step 6:
[0176] The server sends the generated training plan and feedback to the terminal. The input here is the generated training plan and feedback, and the output is that it is delivered accurately to the terminal. Specifically, the data is encrypted and transmitted using a communication protocol.
[0177] Step 7:
[0178] The application displays the training plan and feedback received by the device to the user. The input is data received from the server, and the output is visual guidance that the user can see and confirm. Specifically, the application provides buttons and menus that prompt users to display and operate the information based on UI / UX design.
[0179] Step 8:
[0180] The user actually performs the training. The input is the training plan and feedback provided on the device, and the output is the improved singing skills and emotional state achieved through the training. Specifically, the user follows instructions, practices singing, and records their own progress.
[0181] (Application Example 2)
[0182] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".
[0183] Traditional voice analysis systems focus on evaluating a user's singing ability, but they do not provide feedback or practice plans that take into account the user's emotional state. Therefore, they fail to maximize user psychological satisfaction and practice effectiveness, making it difficult to provide a learning experience optimized for each individual user.
[0184] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0185] In this invention, the server includes an acquisition means for acquiring audio data from the user, an evaluation means for analyzing the acquired audio data and evaluating singing ability, a generation means for generating a practice plan suitable for the user based on the analysis results, an identification means for identifying the user's emotional state and providing feedback according to that state, and a provision means for providing the generated practice plan and feedback to the user. This makes it possible not only to improve the user's singing ability but also to provide an individualized practice experience that takes into account their emotional state.
[0186] "Voice data" refers to information obtained by converting a user's speech into digital signals.
[0187] "Acquisition means" refers to a method or apparatus for collecting voice data from a user.
[0188] "Analysis" is the process of extracting specific indicators and features based on audio data and evaluating them.
[0189] "Evaluation means" refers to a method or apparatus for analyzing acquired audio data to quantify or index singing ability.
[0190] "Generation means" refers to a method or apparatus for creating a practice plan suitable for the user based on the analysis results.
[0191] "Identification means" refers to a method or apparatus for identifying a user's emotional state from audio data.
[0192] "Feedback" is the process of returning information to the user based on the results of voice analysis and their emotional state.
[0193] "Means of delivery" refers to a method or device for presenting the generated practice plan and feedback to the user.
[0194] The system that realizes this invention works by having the user record their singing voice using a dedicated application and send the data to a server. The terminal first uses a microphone to record the voice and temporarily saves the recorded data to local storage. Then, it uploads this to the server. The server uses an AI model, such as TensorFlow, to analyze the singing ability from the received voice data. Key indicators such as pitch, rhythm, and vocal characteristics are calculated.
[0195] Furthermore, the server uses an emotion engine to identify the user's emotional state based on the voice data. This emotion engine analyzes changes in voice tone, pitch, and speed to determine the user's psychological state. Emotion analysis is performed using technologies such as Amazon Rekognition.
[0196] Based on this data, the server generates a personalized practice plan. This plan takes into account the analyzed singing skills and emotional state, and incorporates specific skill-building training and emotional support.
[0197] Finally, the generated practice plan and feedback are sent to the device. Users can review this on their smartphone or tablet and practice appropriately. For example, if the user is feeling nervous, singing exercises to help them relax will be provided.
[0198] For example, if a user wants to alleviate nervousness before a presentation, the emotion engine will detect their level of tension and suggest relaxing exercises. This allows the user to improve both their mental state and singing skills. An example of a prompt to the generative AI model would be, "Please suggest relaxing singing exercises. The user is currently feeling nervous."
[0199] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0200] Step 1:
[0201] The user records audio using their device. The input is the user's raw singing voice, and the output is digitized audio data. The audio is captured using a microphone and temporarily stored in local storage.
[0202] Step 2:
[0203] The terminal sends audio data to the server. The input is the audio data stored on the terminal, and the output is the audio data transferred to the server. A communication module is used to transmit the data.
[0204] Step 3:
[0205] The server analyzes the received audio data using an AI model. The input is the audio data sent to the server, and the output is evaluation metrics such as pitch, rhythm, and voice quality. TensorFlow and similar tools are used to analyze the data and generate numerical scores.
[0206] Step 4:
[0207] The server analyzes the tone, pitch, and speed of the audio data to identify the user's emotional state. The input is audio data, and the output is the result of the emotional state identification. This is achieved using an emotion engine, such as Amazon Rekognition.
[0208] Step 5:
[0209] The server generates individual practice plans based on vocal ability evaluations and emotional states. The inputs are analyzed evaluation metrics and emotional states, and the output is a customized practice plan. Optimization processing is performed using the generation method.
[0210] Step 6:
[0211] The server sends the generated practice plan and emotion-based feedback to the terminal. The input is the practice plan and feedback content, and the output is the information presented to the terminal. Data is transmitted via a communication protocol.
[0212] Step 7:
[0213] The user checks feedback from their device and practices the provided exercises. The input is the exercise plan and feedback displayed on the device, and the output is the user's actions. Feedback is displayed and guidance is provided to the user using the device interface.
[0214] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.
[0215] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0216] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.
[0217] [Second Embodiment]
[0218] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.
[0219] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.
[0220] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0221] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.
[0222] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0223] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0224] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0225] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0226] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0227] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0228] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0229] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0230] This invention is a voice analysis system designed to help users improve their singing ability. The system acquires voice data and analyzes pitch, rhythm, and voice quality to provide the user with an optimal practice plan.
[0231] First, the user records their singing voice using a dedicated application. This application runs on smartphones and PCs, and allows users to input audio data through its user interface. Once recording is complete, the device receives this audio data and sends it to a server via the internet.
[0232] The server analyzes the received audio data based on an AI model. The AI model evaluates various metrics in the audio and generates analysis results for pitch, rhythm, and voice quality. These results are used to quantify the user's singing ability.
[0233] Next, the server generates an individualized practice plan based on the analysis results. This practice plan includes specific exercises and training content designed to strengthen the user's weaknesses and further develop their strengths, compared to past data. The generated plan is sent from the server to the terminal and provided to the user.
[0234] The device visually displays the received practice plan to the user within the application. The user can review this feedback and improve their singing technique by incorporating the suggested practice into their daily life. This allows users to efficiently improve their singing ability without being limited by time or location.
[0235] For example, if a user wants to improve their pitch accuracy in karaoke, they can use this system to analyze their singing voice and receive a pitch-focused practice plan. This allows the user to learn a specific pitch practice approach and focus on their own challenges.
[0236] The following describes the processing flow.
[0237] Step 1:
[0238] The user launches a dedicated application, selects a song they want to sing, and begins recording. The user then presses the record button on their device to record their singing voice.
[0239] Step 2:
[0240] Once recording is complete, the device temporarily stores the recorded audio data in digital format. This data is saved at the required accuracy and sample rate for later analysis.
[0241] Step 3:
[0242] The device generates a communication request to send the stored audio data to the server and sends this data to the server via the internet.
[0243] Step 4:
[0244] The server inputs the received audio data into an AI model for analysis. The AI model analyzes the audio data and extracts singing indicators such as pitch, rhythm, and vocal quality.
[0245] Step 5:
[0246] The server quantifies the user's singing ability based on the analysis results of the AI model. The resulting numerical score objectively evaluates the user's performance.
[0247] Step 6:
[0248] Based on the evaluation results, the server compares them with past data to generate a practice plan tailored to the user's characteristics. This plan includes specific practice exercises and recommended songs.
[0249] Step 7:
[0250] The server sends the generated practice plan and evaluation results to the terminal.
[0251] Step 8:
[0252] The device displays the received practice plan and evaluation results in the user interface, allowing the user to review the content. The user then uses this information to begin training and aim to improve their singing ability.
[0253] (Example 1)
[0254] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0255] The present invention aims to provide a method for users to effectively improve their voice and singing skills. Current technology makes it difficult to generate training plans based on the individual characteristics of each user, and standard plans are often applied. Therefore, there is a challenge in that efficient training that takes into account the weaknesses and strengths of individual users cannot be achieved.
[0256] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0257] In this invention, the server includes a receiving means for acquiring voice information from a user, an evaluation means for analyzing the received voice information and evaluating singing skills, and a generating means for generating a training plan suitable for the user based on the analysis results. This makes it possible to provide a training plan optimized for each individual user.
[0258] "Voice information" refers to data that records the content of a user's speech and represents it in digital format.
[0259] A "user" is an individual who wishes to have their voice analyzed and trained through this system.
[0260] "Receiving means" refers to a device or program that has the function of acquiring voice information transmitted by a user and incorporating it into the system.
[0261] "Analysis" is the process of extracting and evaluating indicators such as pitch, rhythm, and sound quality from received audio information using a specific algorithm.
[0262] "Singing ability" is a general term for the musical expressiveness and technique possessed by the user.
[0263] "Evaluation means" refers to a device or program that has the function of objectively evaluating a user's singing ability based on data obtained through analysis.
[0264] "Generation means" refers to a device or program that has the function of constructing individual training plans based on evaluation results.
[0265] A "training plan" is a learning program that combines specific exercises and practice content with the aim of improving the user's singing skills.
[0266] "Means of delivery" refers to a device or program that has the function of presenting the generated training plan to the user visually or audibly.
[0267] A "generative artificial intelligence model" is a model built using machine learning algorithms and used for analyzing speech information.
[0268] This invention is a voice analysis system designed to help users improve their singing skills. This system supports the improvement of musical expressiveness by acquiring and analyzing the user's voice information and providing an appropriate training plan. Embodiments of this invention are described below.
[0269] Users record audio information using a dedicated application installed on their smartphone or PC. This application features a user-friendly interface, allowing users to easily record audio and send it to the system with the press of a button. The recorded audio information is transmitted from the device to the server via the internet. During transmission, the data is encrypted to ensure communication security.
[0270] The server stores the received audio information and uses a generative AI model to analyze pitch, rhythm, and tone quality. This AI model utilizes speech recognition technology to extract features from the audio data and evaluate singing ability. After analysis, the server generates a training plan tailored to the user's characteristics based on the evaluation results. This training plan includes specific training content to improve musical skills and is optimized considering the user's past data and similar datasets.
[0271] The generated training plan is sent from the server to the terminal and presented to the user visually within the application. Based on this feedback, the user can practice regularly, enabling them to efficiently improve their singing skills.
[0272] For example, if a user wants to improve their pitch accuracy, they can use the system to analyze their voice information and receive a pitch-specific training plan. By practicing according to this plan, the user can focus on specific musical challenges.
[0273] An example of a prompt would be, "Analyze my singing voice and create a training plan that clearly identifies areas for improvement." Through this prompt, the results of the speech analysis performed by the generated AI model can be optimally utilized.
[0274] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0275] Step 1:
[0276] The user launches a dedicated application and records their singing using the microphone on their smartphone or PC. The process begins by pressing the record button, and ends when the user taps the end button. During this step, the user's voice information is converted from an analog signal to a digital signal and saved to the device.
[0277] Step 2:
[0278] The terminal sends the recorded voice information file to the server via the Internet. When sending, the voice information is securely delivered to the server through an encryption process. The input of this step is a digital voice file, and the output is stored as secure data on the server.
[0279] Step 3:
[0280] The server decodes the received voice information in preparation for analysis and inputs it into the generated AI model. This model analyzes the voice information and generates indicators for judging pitch, rhythm, and sound quality. The input is the decoded digital voice information, and the output is the analyzed indicator data of singing skills.
[0281] Step 4:
[0282] Based on the analysis results, the server generates a training plan suitable for the user. An individually customized training plan is designed using evaluation indicators. In this process, past user data and related datasets are used as references. The input is the analysis indicator data, and the output is the generated training plan.
[0283] Step 5:
[0284] The server sends the generated training plan to the terminal. This plan is converted into a format that can be easily displayed on the user page or application. The transmitted data is encrypted to ensure the security of communication. The input is the training plan data, and the output is the training plan stored on the terminal.
[0285] Step 6:
[0286] The terminal visually displays the received training plan to the user through the application interface. The user can check the training content on the screen and carry out daily training by following the instructions. The input for this step is the training plan stored in the terminal, and the output is converted into the user's own actions.
[0287] (Application Example 1)
[0288] Next, Application Example 1 will be described. In the following description, the data processing device 12 is referred to as the "server", and the smart glasses 214 are referred to as the "terminal".
[0289] In music and singing training, there is a problem that it is difficult to obtain real-time feedback according to the characteristics and weaknesses of individual voices. Especially in the context of self-training, there is often a lack of specialized knowledge and training, and effective practice cannot be carried out. Therefore, it is necessary to provide means for effectively improving singing techniques.
[0290] The specific processing by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0291] In this invention, the server includes a receiving means for obtaining voice data from the user, an evaluation means for analyzing the received voice data to evaluate the voice technique, and a generating means for generating a practice plan suitable for the user based on the analysis result. As a result, by the robot recording the user's voice in real time and providing evaluations in real time, it becomes possible to effectively and continuously improve singing techniques.
[0292] "Voice data" is a collection of digital signals obtained by recording the user's voice.
[0293] "User" refers to an individual who wants to improve their singing technique using this invention.
[0294] "Receiving means" refers to the process or device for obtaining voice data from the user.
[0295] "Evaluation method" refers to the process of analyzing received audio data and expressing voice skills using numerical values or indicators based on the results.
[0296] "Generative means" refers to the process of creating individual practice plans based on evaluation results.
[0297] A "practice plan" includes guidelines for individually customized exercises and training designed to improve the user's singing technique.
[0298] "Means of delivery" refers to the processes and devices used to communicate the generated practice plan to the user.
[0299] A "robot" refers to an autonomous machine or device that provides voice recording and feedback in real time.
[0300] "Real-time" refers to a process where data is processed and feedback is provided within a timeframe close to the moment it is acquired.
[0301] The system for realizing this invention requires a series of processes for acquiring, analyzing, evaluating, and providing feedback on audio data.
[0302] First, the user acquires voice data through the robot. The robot uses its built-in microphone to record the user's speech. This recorded data is then sent in real time to a locally built-in AI model.
[0303] The server analyzes the audio data received from the robot. This analysis utilizes generative AI models based on libraries such as TensorFlow and PyTorch. These models evaluate pitch, rhythm, and voice quality from the audio data and quantify the results.
[0304] Based on the evaluation results, the server generates an optimal practice plan for the user. This practice plan includes specific training instructions to reinforce the user's weaknesses compared to past practice data. In this generation process, a database management system (DBMS) is used for collation with past data.
[0305] Finally, the terminal provides the generated practice plan to the user. The robot provides real-time audio feedback using the built-in speaker and visually displays instructions and analysis results on the display.
[0306] As a specific example, when a user wants to improve the accuracy of a specific pitch, the robot can provide real-time feedback with specific advice such as "try aiming for a slightly higher pitch" based on the recorded data. In this process, instruction sentences such as "Please analyze the following audio data, evaluate the pitch, generate feedback content based on the analysis results, and return specific advice regarding the pitch." are used in the prompt text.
[0307] The flow of the specific process in Application Example 1 will be described using FIG. 12.
[0308] Step 1:
[0309] The user records audio using the microphone built into the robot. The input is the user's voice, and the output is saved as audio data. The robot converts the audio data into digital format and prepares it for analysis.
[0310] Step 2:
[0311] The device sends recorded audio data to a server for analysis by a generating AI model. The input is audio data, and the AI model evaluates pitch, rhythm, and voice quality. The output is the analyzed data, with each metric represented numerically. Specifically, frameworks such as TensorFlow and PyTorch are used to extract features from the audio data.
[0312] Step 3:
[0313] The server generates an optimal training plan for the user based on the analysis results. The input consists of the analyzed data and past training data, and the output is a customized training plan. Data processing involves comparison with similar data sets, and individual training instructions are incorporated.
[0314] Step 4:
[0315] The device provides the user with a generated practice plan and offers audio and visual feedback. The input is the practice plan, and the output is notification information for the user. Specifically, it provides audio feedback through its built-in speaker and displays the practice content on the screen.
[0316] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0317] This invention provides a technology that combines an emotion engine with a voice analysis system that supports the improvement of users' singing abilities. The system acquires the user's singing voice, generates evaluations and practice plans based on that voice data, and at the same time identifies the user's emotional state and provides feedback and adjusts the plan accordingly.
[0318] Users record their singing voices using a dedicated application. This recording data is temporarily stored on the device and then sent to a server. The server analyzes the received data using an AI model to evaluate singing ability. It quantifies pitch, rhythm, voice quality, etc., and generates a numerical score.
[0319] Furthermore, the server analyzes changes in the user's voice tone, pitch, and speed, allowing the emotion engine to identify the user's emotional state. Based on this emotional data, it understands the user's mood while singing. This information is useful in generating practice plans, making it possible to suggest practice content tailored to specific emotional states.
[0320] Taking into account the emotional state identified by the emotion engine, the server generates a personalized practice plan. For example, if the user is determined to be stressed, it can provide vocal exercises to promote relaxation or training using familiar music. It can also provide positive feedback to boost the user's motivation.
[0321] Finally, the server sends the generated practice plan and emotion-based feedback to the device. The device displays these to the user on the application, providing guidance for the user to actually train. In this way, the user can train in a way that is optimized for them, receiving support not only for improving their singing ability but also emotionally.
[0322] For example, if a user uses this system to calm their nerves before a presentation, the emotional engine can detect their anxiety and suggest practice methods to promote relaxation. This allows the user to effectively improve both their mental state and their skills.
[0323] The following describes the processing flow.
[0324] Step 1:
[0325] The user launches a dedicated application and sings a selected song using the recording function. After finishing singing, the user presses the stop button, saving the singing voice as digital audio data to the device.
[0326] Step 2:
[0327] The terminal prepares to send the stored audio data to the server and transmits the data to the server via a communication protocol.
[0328] Step 3:
[0329] The server begins processing the received audio data. Using an AI model, it extracts indicators of pitch, rhythm, and vocal quality, and generates a numerical score related to singing ability.
[0330] Step 4:
[0331] The server further uses an emotion engine to identify the user's emotional state based on the tone, pitch, and speed of the voice extracted from the audio data. This emotional information is used as supplementary data for generating practice plans.
[0332] Step 5:
[0333] The server generates a personalized practice plan, taking into account the analysis results and the user's emotional information. This plan is optimized to improve specific singing techniques and is tailored to the user's mental state.
[0334] Step 6:
[0335] The generated practice plan and emotion-based feedback are sent from the server to the device. The device displays this information within the user interface, allowing the user to review the content.
[0336] Step 7:
[0337] Users review the displayed feedback and practice plan, and then follow the instructions to train. By repeating this process, users can improve their singing technique and emotional control.
[0338] (Example 2)
[0339] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0340] Conventional voice analysis systems can evaluate a user's singing ability, but they have difficulty providing training plans that take into account the user's emotional state. As a result, they have been unable to provide optimal training for the user, and have faced the challenge of not being able to effectively improve their singing ability.
[0341] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0342] In this invention, the server includes a receiving means for acquiring voice information from the user, an evaluation means for analyzing the received voice information and evaluating singing ability, and an identification means for analyzing the voice information and identifying the user's emotional state. This makes it possible to provide a more personalized training plan that takes into account the user's emotional state in addition to their singing ability.
[0343] "Voice information" refers to data obtained by recording the user's voice and expressing it in a format that the system can analyze.
[0344] A "user" is an individual who uses a voice analysis system with the aim of improving their own singing ability.
[0345] "Receiving means" refers to a method or device for the system to acquire voice information transmitted by a user.
[0346] "Analysis" refers to the analytical process performed by the system to evaluate and identify singing ability and emotional state based on acquired audio information.
[0347] "Evaluation means" refers to a method or apparatus for analyzing audio information and expressing singing ability as a numerical value or indicator.
[0348] "Generation means" refers to a method or apparatus for creating a training plan suitable for the user based on the analysis results.
[0349] "Means of delivery" refers to a method or device for presenting the generated training plan to the user and using it as guidance for the training.
[0350] "Identification means" refers to a method or device for identifying a user's emotional state based on voice information.
[0351] "Adjustment means" refers to a method or apparatus for optimizing a training plan based on identified emotional states.
[0352] "Emotional state" refers to the psychological or emotional condition determined based on the user's vocal characteristics.
[0353] This invention relates to a voice analysis system that analyzes a user's singing ability and emotional state and provides an individualized training plan. The system operates primarily with the involvement of a server, a terminal, and the user.
[0354] First, users install a dedicated application on their device and use the recording function to collect their singing voice. The recorded audio data is temporarily stored on the device and then sent to a server via a communication network. At this time, the device appropriately encodes the audio information and sends the data to the server using a secure protocol.
[0355] The server first inputs the received audio information into a generating AI model. The model analyzes technical elements such as pitch, rhythm, and tone quality, and quantifies and evaluates the user's singing ability. This evaluation helps identify areas where the user should improve. Following the evaluation, the tone, pitch, and speed of the audio information are analyzed, and an emotion engine is used to identify the user's emotional state. This step identifies the user's current emotional state.
[0356] Next, the server generates an optimized training plan for each user based on the analysis results and emotional state. For example, if tension is detected, it provides a plan that includes exercises to promote relaxation. This training plan helps users improve their singing ability efficiently and continuously.
[0357] Finally, the generated training plan and feedback are sent to the device and made available to the user within the application. The user can then refer to this information to perform appropriate training at home or elsewhere.
[0358] As a concrete example, let's look at an example of a prompt message.
[0359] "Evaluate this singing voice, analyze the user's emotions based on their tone and pitch, and then propose an optimal voice training plan tailored to those emotions."
[0360] In this way, the system allows users to achieve not only technical singing skills but also emotional growth.
[0361] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0362] Step 1:
[0363] The user records their singing voice using a dedicated application. The recorded audio information is temporarily stored on the device. At this stage, the input is the user's raw singing voice, which is stored on the device as digital audio data. Specifically, the audio is captured through the device's microphone and encoded into WAV or MP3 format data.
[0364] Step 2:
[0365] The device sends the stored audio data to the server. During transmission, the audio data is appropriately encoded and encrypted before being sent to the server. The input for this step is the encoded audio data, and the output is the data as it arrives at the server. Specifically, network protocols (e.g., HTTPS) are used to ensure data security during transmission.
[0366] Step 3:
[0367] The server analyzes the received audio data. Using the received audio data as input, the server's AI model analyzes pitch, rhythm, and sound quality, generating a numerical score. The output provides a detailed numerical evaluation for each metric. Specifically, the AI model uses deep learning techniques to analyze the data and evaluate singing skills.
[0368] Step 4:
[0369] The server uses an emotion engine to identify the user's emotional state from the voice data. The input is changes in tone, pitch, and speed of the voice, and the output is the identified emotional state of the user. Specifically, an emotion analysis algorithm using voice features is employed.
[0370] Step 5:
[0371] The server generates an optimal training plan for the user based on the analysis results and emotional state. The input is the singing skill evaluation and emotional identification results obtained so far, and the output is an individualized training plan. Specifically, the generating AI model combines each piece of data to construct the training content.
[0372] Step 6:
[0373] The server sends the generated training plan and feedback to the terminal. The input here is the generated training plan and feedback, and the output is that it is delivered accurately to the terminal. Specifically, the data is encrypted and transmitted using a communication protocol.
[0374] Step 7:
[0375] The application displays the training plan and feedback received by the device to the user. The input is data received from the server, and the output is visual guidance that the user can see and confirm. Specifically, the application provides buttons and menus that prompt users to display and operate the information based on UI / UX design.
[0376] Step 8:
[0377] The user actually performs the training. The input is the training plan and feedback provided on the device, and the output is the improved singing skills and emotional state achieved through the training. Specifically, the user follows instructions, practices singing, and records their own progress.
[0378] (Application Example 2)
[0379] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0380] Traditional voice analysis systems focus on evaluating a user's singing ability, but they do not provide feedback or practice plans that take into account the user's emotional state. Therefore, they fail to maximize user psychological satisfaction and practice effectiveness, making it difficult to provide a learning experience optimized for each individual user.
[0381] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0382] In this invention, the server includes an acquisition means for acquiring audio data from the user, an evaluation means for analyzing the acquired audio data and evaluating singing ability, a generation means for generating a practice plan suitable for the user based on the analysis results, an identification means for identifying the user's emotional state and providing feedback according to that state, and a provision means for providing the generated practice plan and feedback to the user. This makes it possible not only to improve the user's singing ability but also to provide an individualized practice experience that takes into account their emotional state.
[0383] "Voice data" refers to information obtained by converting a user's speech into digital signals.
[0384] "Acquisition means" refers to a method or apparatus for collecting voice data from a user.
[0385] "Analysis" is the process of extracting specific indicators and features based on audio data and evaluating them.
[0386] "Evaluation means" refers to a method or apparatus for analyzing acquired audio data to quantify or index singing ability.
[0387] "Generation means" refers to a method or apparatus for creating a practice plan suitable for the user based on the analysis results.
[0388] "Identification means" refers to a method or apparatus for identifying a user's emotional state from audio data.
[0389] "Feedback" is the process of returning information to the user based on the results of voice analysis and their emotional state.
[0390] "Means of delivery" refers to a method or device for presenting the generated practice plan and feedback to the user.
[0391] The system that realizes this invention works by having the user record their singing voice using a dedicated application and send the data to a server. The terminal first uses a microphone to record the voice and temporarily saves the recorded data to local storage. Then, it uploads this to the server. The server uses an AI model, such as TensorFlow, to analyze the singing ability from the received voice data. Key indicators such as pitch, rhythm, and vocal characteristics are calculated.
[0392] Furthermore, the server uses an emotion engine to identify the user's emotional state based on the voice data. This emotion engine analyzes changes in voice tone, pitch, and speed to determine the user's psychological state. Emotion analysis is performed using technologies such as Amazon Rekognition.
[0393] Based on this data, the server generates a personalized practice plan. This plan takes into account the analyzed singing skills and emotional state, and incorporates specific skill-building training and emotional support.
[0394] Finally, the generated practice plan and feedback are sent to the device. Users can review this on their smartphone or tablet and practice appropriately. For example, if the user is feeling nervous, singing exercises to help them relax will be provided.
[0395] For example, if a user wants to alleviate nervousness before a presentation, the emotion engine will detect their level of tension and suggest relaxing exercises. This allows the user to improve both their mental state and singing skills. An example of a prompt to the generative AI model would be, "Please suggest relaxing singing exercises. The user is currently feeling nervous."
[0396] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0397] Step 1:
[0398] The user records audio using their device. The input is the user's raw singing voice, and the output is digitized audio data. The audio is captured using a microphone and temporarily stored in local storage.
[0399] Step 2:
[0400] The terminal sends audio data to the server. The input is the audio data stored on the terminal, and the output is the audio data transferred to the server. A communication module is used to transmit the data.
[0401] Step 3:
[0402] The server analyzes the received audio data using an AI model. The input is the audio data sent to the server, and the output is evaluation metrics such as pitch, rhythm, and voice quality. TensorFlow and similar tools are used to analyze the data and generate numerical scores.
[0403] Step 4:
[0404] The server analyzes the tone, pitch, and speed of the audio data to identify the user's emotional state. The input is audio data, and the output is the result of the emotional state identification. This is achieved using an emotion engine, such as Amazon Rekognition.
[0405] Step 5:
[0406] The server generates individual practice plans based on vocal ability evaluations and emotional states. The inputs are analyzed evaluation metrics and emotional states, and the output is a customized practice plan. Optimization processing is performed using the generation method.
[0407] Step 6:
[0408] The server sends the generated practice plan and emotion-based feedback to the terminal. The input is the practice plan and feedback content, and the output is the information presented to the terminal. Data is transmitted via a communication protocol.
[0409] Step 7:
[0410] The user checks feedback from their device and practices the provided exercises. The input is the exercise plan and feedback displayed on the device, and the output is the user's actions. Feedback is displayed and guidance is provided to the user using the device interface.
[0411] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0412] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0413] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.
[0414] [Third Embodiment]
[0415] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.
[0416] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.
[0417] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0418] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.
[0419] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0420] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0421] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0422] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0423] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0424] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0425] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0426] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".
[0427] This invention is a voice analysis system designed to help users improve their singing ability. The system acquires voice data and analyzes pitch, rhythm, and voice quality to provide the user with an optimal practice plan.
[0428] First, the user records their singing voice using a dedicated application. This application runs on smartphones and PCs, and allows users to input audio data through its user interface. Once recording is complete, the device receives this audio data and sends it to a server via the internet.
[0429] The server analyzes the received audio data based on an AI model. The AI model evaluates various metrics in the audio and generates analysis results for pitch, rhythm, and voice quality. These results are used to quantify the user's singing ability.
[0430] Next, the server generates an individualized practice plan based on the analysis results. This practice plan includes specific exercises and training content designed to strengthen the user's weaknesses and further develop their strengths, compared to past data. The generated plan is sent from the server to the terminal and provided to the user.
[0431] The device visually displays the received practice plan to the user within the application. The user can review this feedback and improve their singing technique by incorporating the suggested practice into their daily life. This allows users to efficiently improve their singing ability without being limited by time or location.
[0432] For example, if a user wants to improve their pitch accuracy in karaoke, they can use this system to analyze their singing voice and receive a pitch-focused practice plan. This allows the user to learn a specific pitch practice approach and focus on their own challenges.
[0433] The following describes the processing flow.
[0434] Step 1:
[0435] The user launches a dedicated application, selects a song they want to sing, and begins recording. The user then presses the record button on their device to record their singing voice.
[0436] Step 2:
[0437] Once recording is complete, the device temporarily stores the recorded audio data in digital format. This data is saved at the required accuracy and sample rate for later analysis.
[0438] Step 3:
[0439] The device generates a communication request to send the stored audio data to the server and sends this data to the server via the internet.
[0440] Step 4:
[0441] The server inputs the received audio data into an AI model for analysis. The AI model analyzes the audio data and extracts singing indicators such as pitch, rhythm, and vocal quality.
[0442] Step 5:
[0443] The server quantifies the user's singing ability based on the analysis results of the AI model. The resulting numerical score objectively evaluates the user's performance.
[0444] Step 6:
[0445] Based on the evaluation results, the server compares them with past data to generate a practice plan tailored to the user's characteristics. This plan includes specific practice exercises and recommended songs.
[0446] Step 7:
[0447] The server sends the generated practice plan and evaluation results to the terminal.
[0448] Step 8:
[0449] The device displays the received practice plan and evaluation results in the user interface, allowing the user to review the content. The user then uses this information to begin training and aim to improve their singing ability.
[0450] (Example 1)
[0451] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0452] The present invention aims to provide a method for users to effectively improve their voice and singing skills. Current technology makes it difficult to generate training plans based on the individual characteristics of each user, and standard plans are often applied. Therefore, there is a challenge in that efficient training that takes into account the weaknesses and strengths of individual users cannot be achieved.
[0453] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0454] In this invention, the server includes a receiving means for acquiring voice information from a user, an evaluation means for analyzing the received voice information and evaluating singing skills, and a generating means for generating a training plan suitable for the user based on the analysis results. This makes it possible to provide a training plan optimized for each individual user.
[0455] "Voice information" refers to data that records the content of a user's speech and represents it in digital format.
[0456] A "user" is an individual who wishes to have their voice analyzed and trained through this system.
[0457] "Receiving means" refers to a device or program that has the function of acquiring voice information transmitted by a user and incorporating it into the system.
[0458] "Analysis" is the process of extracting and evaluating indicators such as pitch, rhythm, and sound quality from received audio information using a specific algorithm.
[0459] "Singing ability" is a general term for the musical expressiveness and technique possessed by the user.
[0460] "Evaluation means" refers to a device or program that has the function of objectively evaluating a user's singing ability based on data obtained through analysis.
[0461] "Generation means" refers to a device or program that has the function of constructing individual training plans based on evaluation results.
[0462] A "training plan" is a learning program that combines specific exercises and practice content with the aim of improving the user's singing skills.
[0463] "Means of delivery" refers to a device or program that has the function of presenting the generated training plan to the user visually or audibly.
[0464] A "generative artificial intelligence model" is a model built using machine learning algorithms and used for analyzing speech information.
[0465] This invention is a voice analysis system designed to help users improve their singing skills. This system supports the improvement of musical expressiveness by acquiring and analyzing the user's voice information and providing an appropriate training plan. Embodiments of this invention are described below.
[0466] Users record audio information using a dedicated application installed on their smartphone or PC. This application features a user-friendly interface, allowing users to easily record audio and send it to the system with the press of a button. The recorded audio information is transmitted from the device to the server via the internet. During transmission, the data is encrypted to ensure communication security.
[0467] The server stores the received audio information and uses a generative AI model to analyze pitch, rhythm, and tone quality. This AI model utilizes speech recognition technology to extract features from the audio data and evaluate singing ability. After analysis, the server generates a training plan tailored to the user's characteristics based on the evaluation results. This training plan includes specific training content to improve musical skills and is optimized considering the user's past data and similar datasets.
[0468] The generated training plan is sent from the server to the terminal and presented to the user visually within the application. Based on this feedback, the user can practice regularly, enabling them to efficiently improve their singing skills.
[0469] For example, if a user wants to improve their pitch accuracy, they can use the system to analyze their voice information and receive a pitch-specific training plan. By practicing according to this plan, the user can focus on specific musical challenges.
[0470] An example of a prompt would be, "Analyze my singing voice and create a training plan that clearly identifies areas for improvement." Through this prompt, the results of the speech analysis performed by the generated AI model can be optimally utilized.
[0471] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0472] Step 1:
[0473] The user launches a dedicated application and records their singing using the microphone on their smartphone or PC. The process begins by pressing the record button, and ends when the user taps the end button. During this step, the user's voice information is converted from an analog signal to a digital signal and saved to the device.
[0474] Step 2:
[0475] The terminal sends the recorded audio information file to the server via the internet. During transmission, the audio information is encrypted to ensure it reaches the server securely. The input for this step is a digital audio file, and the output is stored as secure data on the server.
[0476] Step 3:
[0477] The server decodes the received audio information in preparation for analysis and inputs it into a generating AI model. This model analyzes the audio information and generates indices to determine pitch, rhythm, and sound quality. The input is the decoded digital audio information, and the output is the analyzed singing skill index data.
[0478] Step 4:
[0479] Based on the analysis results, the server generates a training plan tailored to the user. It designs individually customized training plans using evaluation metrics. This process references historical user data and relevant datasets. The input is the analysis metric data, and the output is the generated training plan.
[0480] Step 5:
[0481] The server sends the generated training plan to the terminal. This plan is converted into a format that can be easily displayed on the user page or within the application. The transmitted data is encrypted to ensure the security of the communication. The input is the training plan data, and the output is the training plan stored on the terminal.
[0482] Step 6:
[0483] The device visually displays the received training plan to the user through the application interface. The user can perform daily training by reviewing the training content on the screen and following the instructions. The input for this step is the training plan stored on the device, and the output is converted into the user's own actions.
[0484] (Application Example 1)
[0485] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0486] In music and singing training, there is a challenge in obtaining real-time feedback tailored to individual vocal characteristics and weaknesses. Especially in self-study settings, a lack of specialized knowledge and training is common, hindering effective practice. Therefore, there is a need to provide means to effectively improve singing technique.
[0487] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0488] In this invention, the server includes a receiving means for acquiring voice data from the user, an evaluation means for analyzing the received voice data and evaluating the voice technique, and a generating means for generating a practice plan suitable for the user based on the analysis results. This enables the robot to record the user's vocalizations on the spot and provide evaluations in real time, thereby allowing for effective and continuous improvement of singing technique.
[0489] "Voice data" is a collection of digital signals that are recordings of the user's speech.
[0490] A "user" is an individual who wishes to improve their singing technique using this invention.
[0491] "Receiving means" refers to the process or device used to acquire audio data from the user.
[0492] "Evaluation method" refers to the process of analyzing received audio data and expressing voice skills using numerical values or indicators based on the results.
[0493] "Generative means" refers to the process of creating individual practice plans based on evaluation results.
[0494] A "practice plan" includes guidelines for individually customized exercises and training designed to improve the user's singing technique.
[0495] "Means of delivery" refers to the processes and devices used to communicate the generated practice plan to the user.
[0496] A "robot" refers to an autonomous machine or device that provides voice recording and feedback in real time.
[0497] "Real-time" refers to a process where data is processed and feedback is provided within a timeframe close to the moment it is acquired.
[0498] The system for realizing this invention requires a series of processes for acquiring, analyzing, evaluating, and providing feedback on audio data.
[0499] First, the user acquires voice data through the robot. The robot uses its built-in microphone to record the user's speech. This recorded data is then sent in real time to a locally built-in AI model.
[0500] The server analyzes the audio data received from the robot. This analysis utilizes generative AI models based on libraries such as TensorFlow and PyTorch. These models evaluate pitch, rhythm, and voice quality from the audio data and quantify the results.
[0501] Based on the evaluation results, the server generates an optimal training plan for the user. This training plan includes specific training instructions to address the user's weaknesses, compared to past training data. A database management system (DBMS) is used in this generation process to compare the current data with historical data.
[0502] Finally, the terminal provides the user with the generated practice plan. The robot provides real-time voice feedback using its built-in speaker and visually displays instructions and analysis results on its screen.
[0503] For example, if a user wants to improve the accuracy of a particular pitch, the robot can provide real-time feedback based on the recorded data, offering specific advice such as, "Let's aim for a slightly higher pitch." In this process, the prompt message would include instructions such as, "Analyze the following audio data and evaluate the pitch. Based on the analysis results, generate feedback and return specific advice regarding the pitch."
[0504] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0505] Step 1:
[0506] The user records their voice using the robot's built-in microphone. The input is the user's voice, and the output is saved as audio data. The robot converts the audio data into a digital format and prepares it for analysis.
[0507] Step 2:
[0508] The device sends recorded audio data to a server for analysis by a generating AI model. The input is audio data, and the AI model evaluates pitch, rhythm, and voice quality. The output is the analyzed data, with each metric represented numerically. Specifically, frameworks such as TensorFlow and PyTorch are used to extract features from the audio data.
[0509] Step 3:
[0510] The server generates an optimal training plan for the user based on the analysis results. The input consists of the analyzed data and past training data, and the output is a customized training plan. Data processing involves comparison with similar data sets, and individual training instructions are incorporated.
[0511] Step 4:
[0512] The device provides the user with a generated practice plan and offers audio and visual feedback. The input is the practice plan, and the output is notification information for the user. Specifically, it provides audio feedback through its built-in speaker and displays the practice content on the screen.
[0513] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0514] This invention provides a technology that combines an emotion engine with a voice analysis system that supports the improvement of users' singing abilities. The system acquires the user's singing voice, generates evaluations and practice plans based on that voice data, and at the same time identifies the user's emotional state and provides feedback and adjusts the plan accordingly.
[0515] Users record their singing voices using a dedicated application. This recording data is temporarily stored on the device and then sent to a server. The server analyzes the received data using an AI model to evaluate singing ability. It quantifies pitch, rhythm, voice quality, etc., and generates a numerical score.
[0516] Furthermore, the server analyzes changes in the user's voice tone, pitch, and speed, allowing the emotion engine to identify the user's emotional state. Based on this emotional data, it understands the user's mood while singing. This information is useful in generating practice plans, making it possible to suggest practice content tailored to specific emotional states.
[0517] Taking into account the emotional state identified by the emotion engine, the server generates a personalized practice plan. For example, if the user is determined to be stressed, it can provide vocal exercises to promote relaxation or training using familiar music. It can also provide positive feedback to boost the user's motivation.
[0518] Finally, the server sends the generated practice plan and emotion-based feedback to the device. The device displays these to the user on the application, providing guidance for the user to actually train. In this way, the user can train in a way that is optimized for them, receiving support not only for improving their singing ability but also emotionally.
[0519] For example, if a user uses this system to calm their nerves before a presentation, the emotional engine can detect their anxiety and suggest practice methods to promote relaxation. This allows the user to effectively improve both their mental state and their skills.
[0520] The following describes the processing flow.
[0521] Step 1:
[0522] The user launches a dedicated application and sings a selected song using the recording function. After finishing singing, the user presses the stop button, saving the singing voice as digital audio data to the device.
[0523] Step 2:
[0524] The terminal prepares to send the stored audio data to the server and transmits the data to the server via a communication protocol.
[0525] Step 3:
[0526] The server begins processing the received audio data. Using an AI model, it extracts indicators of pitch, rhythm, and vocal quality, and generates a numerical score related to singing ability.
[0527] Step 4:
[0528] The server further uses an emotion engine to identify the user's emotional state based on the tone, pitch, and speed of the voice extracted from the audio data. This emotional information is used as supplementary data for generating practice plans.
[0529] Step 5:
[0530] The server generates a personalized practice plan, taking into account the analysis results and the user's emotional information. This plan is optimized to improve specific singing techniques and is tailored to the user's mental state.
[0531] Step 6:
[0532] The generated practice plan and emotion-based feedback are sent from the server to the device. The device displays this information within the user interface, allowing the user to review the content.
[0533] Step 7:
[0534] Users review the displayed feedback and practice plan, and then follow the instructions to train. By repeating this process, users can improve their singing technique and emotional control.
[0535] (Example 2)
[0536] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0537] Conventional voice analysis systems can evaluate a user's singing ability, but they have difficulty providing training plans that take into account the user's emotional state. As a result, they have been unable to provide optimal training for the user, and have faced the challenge of not being able to effectively improve their singing ability.
[0538] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0539] In this invention, the server includes a receiving means for acquiring voice information from the user, an evaluation means for analyzing the received voice information and evaluating singing ability, and an identification means for analyzing the voice information and identifying the user's emotional state. This makes it possible to provide a more personalized training plan that takes into account the user's emotional state in addition to their singing ability.
[0540] "Voice information" refers to data obtained by recording the user's voice and expressing it in a format that the system can analyze.
[0541] A "user" is an individual who uses a voice analysis system with the aim of improving their own singing ability.
[0542] "Receiving means" refers to a method or device for the system to acquire voice information transmitted by a user.
[0543] "Analysis" refers to the analytical process performed by the system to evaluate and identify singing ability and emotional state based on acquired audio information.
[0544] "Evaluation means" refers to a method or apparatus for analyzing audio information and expressing singing ability as a numerical value or indicator.
[0545] "Generation means" refers to a method or apparatus for creating a training plan suitable for the user based on the analysis results.
[0546] "Means of delivery" refers to a method or device for presenting the generated training plan to the user and using it as guidance for the training.
[0547] "Identification means" refers to a method or device for identifying a user's emotional state based on voice information.
[0548] "Adjustment means" refers to a method or apparatus for optimizing a training plan based on identified emotional states.
[0549] "Emotional state" refers to the psychological or emotional condition determined based on the user's vocal characteristics.
[0550] This invention relates to a voice analysis system that analyzes a user's singing ability and emotional state and provides an individualized training plan. The system operates primarily with the involvement of a server, a terminal, and the user.
[0551] First, users install a dedicated application on their device and use the recording function to collect their singing voice. The recorded audio data is temporarily stored on the device and then sent to a server via a communication network. At this time, the device appropriately encodes the audio information and sends the data to the server using a secure protocol.
[0552] The server first inputs the received audio information into a generating AI model. The model analyzes technical elements such as pitch, rhythm, and tone quality, and quantifies and evaluates the user's singing ability. This evaluation helps identify areas where the user should improve. Following the evaluation, the tone, pitch, and speed of the audio information are analyzed, and an emotion engine is used to identify the user's emotional state. This step identifies the user's current emotional state.
[0553] Next, the server generates an optimized training plan for each user based on the analysis results and emotional state. For example, if tension is detected, it provides a plan that includes exercises to promote relaxation. This training plan helps users improve their singing ability efficiently and continuously.
[0554] Finally, the generated training plan and feedback are sent to the device and made available to the user within the application. The user can then refer to this information to perform appropriate training at home or elsewhere.
[0555] As a concrete example, let's look at an example of a prompt message.
[0556] "Evaluate this singing voice, analyze the user's emotions based on their tone and pitch, and then propose an optimal voice training plan tailored to those emotions."
[0557] In this way, the system allows users to achieve not only technical singing skills but also emotional growth.
[0558] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0559] Step 1:
[0560] The user records their singing voice using a dedicated application. The recorded audio information is temporarily stored on the device. At this stage, the input is the user's raw singing voice, which is stored on the device as digital audio data. Specifically, the audio is captured through the device's microphone and encoded into WAV or MP3 format data.
[0561] Step 2:
[0562] The device sends the stored audio data to the server. During transmission, the audio data is appropriately encoded and encrypted before being sent to the server. The input for this step is the encoded audio data, and the output is the data as it arrives at the server. Specifically, network protocols (e.g., HTTPS) are used to ensure data security during transmission.
[0563] Step 3:
[0564] The server analyzes the received audio data. Using the received audio data as input, the server's AI model analyzes pitch, rhythm, and sound quality, generating a numerical score. The output provides a detailed numerical evaluation for each metric. Specifically, the AI model uses deep learning techniques to analyze the data and evaluate singing skills.
[0565] Step 4:
[0566] The server uses an emotion engine to identify the user's emotional state from the voice data. The input is changes in tone, pitch, and speed of the voice, and the output is the identified emotional state of the user. Specifically, an emotion analysis algorithm using voice features is employed.
[0567] Step 5:
[0568] The server generates an optimal training plan for the user based on the analysis results and emotional state. The input is the singing skill evaluation and emotional identification results obtained so far, and the output is an individualized training plan. Specifically, the generating AI model combines each piece of data to construct the training content.
[0569] Step 6:
[0570] The server sends the generated training plan and feedback to the terminal. The input here is the generated training plan and feedback, and the output is that it is delivered accurately to the terminal. Specifically, the data is encrypted and transmitted using a communication protocol.
[0571] Step 7:
[0572] The application displays the training plan and feedback received by the device to the user. The input is data received from the server, and the output is visual guidance that the user can see and confirm. Specifically, the application provides buttons and menus that prompt users to display and operate the information based on UI / UX design.
[0573] Step 8:
[0574] The user actually performs the training. The input is the training plan and feedback provided on the device, and the output is the improved singing skills and emotional state achieved through the training. Specifically, the user follows instructions, practices singing, and records their own progress.
[0575] (Application Example 2)
[0576] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0577] Traditional voice analysis systems focus on evaluating a user's singing ability, but they do not provide feedback or practice plans that take into account the user's emotional state. Therefore, they fail to maximize user psychological satisfaction and practice effectiveness, making it difficult to provide a learning experience optimized for each individual user.
[0578] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0579] In this invention, the server includes an acquisition means for acquiring audio data from the user, an evaluation means for analyzing the acquired audio data and evaluating singing ability, a generation means for generating a practice plan suitable for the user based on the analysis results, an identification means for identifying the user's emotional state and providing feedback according to that state, and a provision means for providing the generated practice plan and feedback to the user. This makes it possible not only to improve the user's singing ability but also to provide an individualized practice experience that takes into account their emotional state.
[0580] "Voice data" refers to information obtained by converting a user's speech into digital signals.
[0581] "Acquisition means" refers to a method or apparatus for collecting voice data from a user.
[0582] "Analysis" is the process of extracting specific indicators and features based on audio data and evaluating them.
[0583] "Evaluation means" refers to a method or apparatus for analyzing acquired audio data to quantify or index singing ability.
[0584] "Generation means" refers to a method or apparatus for creating a practice plan suitable for the user based on the analysis results.
[0585] "Identification means" refers to a method or apparatus for identifying a user's emotional state from audio data.
[0586] "Feedback" is the process of returning information to the user based on the results of voice analysis and their emotional state.
[0587] "Means of delivery" refers to a method or device for presenting the generated practice plan and feedback to the user.
[0588] The system that realizes this invention works by having the user record their singing voice using a dedicated application and send the data to a server. The terminal first uses a microphone to record the voice and temporarily saves the recorded data to local storage. Then, it uploads this to the server. The server uses an AI model, such as TensorFlow, to analyze the singing ability from the received voice data. Key indicators such as pitch, rhythm, and vocal characteristics are calculated.
[0589] Furthermore, the server uses an emotion engine to identify the user's emotional state based on the voice data. This emotion engine analyzes changes in voice tone, pitch, and speed to determine the user's psychological state. Emotion analysis is performed using technologies such as Amazon Rekognition.
[0590] Based on this data, the server generates a personalized practice plan. This plan takes into account the analyzed singing skills and emotional state, and incorporates specific skill-building training and emotional support.
[0591] Finally, the generated practice plan and feedback are sent to the device. Users can review this on their smartphone or tablet and practice appropriately. For example, if the user is feeling nervous, singing exercises to help them relax will be provided.
[0592] For example, if a user wants to alleviate nervousness before a presentation, the emotion engine will detect their level of tension and suggest relaxing exercises. This allows the user to improve both their mental state and singing skills. An example of a prompt to the generative AI model would be, "Please suggest relaxing singing exercises. The user is currently feeling nervous."
[0593] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0594] Step 1:
[0595] The user records audio using their device. The input is the user's raw singing voice, and the output is digitized audio data. The audio is captured using a microphone and temporarily stored in local storage.
[0596] Step 2:
[0597] The terminal sends audio data to the server. The input is the audio data stored on the terminal, and the output is the audio data transferred to the server. A communication module is used to transmit the data.
[0598] Step 3:
[0599] The server analyzes the received audio data using an AI model. The input is the audio data sent to the server, and the output is evaluation metrics such as pitch, rhythm, and voice quality. TensorFlow and similar tools are used to analyze the data and generate numerical scores.
[0600] Step 4:
[0601] The server analyzes the tone, pitch, and speed of the audio data to identify the user's emotional state. The input is audio data, and the output is the result of the emotional state identification. This is achieved using an emotion engine, such as Amazon Rekognition.
[0602] Step 5:
[0603] The server generates individual practice plans based on vocal ability evaluations and emotional states. The inputs are analyzed evaluation metrics and emotional states, and the output is a customized practice plan. Optimization processing is performed using the generation method.
[0604] Step 6:
[0605] The server sends the generated practice plan and emotion-based feedback to the terminal. The input is the practice plan and feedback content, and the output is the information presented to the terminal. Data is transmitted via a communication protocol.
[0606] Step 7:
[0607] The user checks feedback from their device and practices the provided exercises. The input is the exercise plan and feedback displayed on the device, and the output is the user's actions. Feedback is displayed and guidance is provided to the user using the device interface.
[0608] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0609] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0610] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.
[0611] [Fourth Embodiment]
[0612] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.
[0613] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.
[0614] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0615] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.
[0616] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0617] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0618] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0619] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.
[0620] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0621] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0622] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0623] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0624] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0625] This invention is a voice analysis system designed to help users improve their singing ability. The system acquires voice data and analyzes pitch, rhythm, and voice quality to provide the user with an optimal practice plan.
[0626] First, the user records their singing voice using a dedicated application. This application runs on smartphones and PCs, and allows users to input audio data through its user interface. Once recording is complete, the device receives this audio data and sends it to a server via the internet.
[0627] The server analyzes the received audio data based on an AI model. The AI model evaluates various metrics in the audio and generates analysis results for pitch, rhythm, and voice quality. These results are used to quantify the user's singing ability.
[0628] Next, the server generates an individualized practice plan based on the analysis results. This practice plan includes specific exercises and training content designed to strengthen the user's weaknesses and further develop their strengths, compared to past data. The generated plan is sent from the server to the terminal and provided to the user.
[0629] The device visually displays the received practice plan to the user within the application. The user can review this feedback and improve their singing technique by incorporating the suggested practice into their daily life. This allows users to efficiently improve their singing ability without being limited by time or location.
[0630] For example, if a user wants to improve their pitch accuracy in karaoke, they can use this system to analyze their singing voice and receive a pitch-focused practice plan. This allows the user to learn a specific pitch practice approach and focus on their own challenges.
[0631] The following describes the processing flow.
[0632] Step 1:
[0633] The user launches a dedicated application, selects a song they want to sing, and begins recording. The user then presses the record button on their device to record their singing voice.
[0634] Step 2:
[0635] Once recording is complete, the device temporarily stores the recorded audio data in digital format. This data is saved at the required accuracy and sample rate for later analysis.
[0636] Step 3:
[0637] The device generates a communication request to send the stored audio data to the server and sends this data to the server via the internet.
[0638] Step 4:
[0639] The server inputs the received audio data into an AI model for analysis. The AI model analyzes the audio data and extracts singing indicators such as pitch, rhythm, and vocal quality.
[0640] Step 5:
[0641] The server quantifies the user's singing ability based on the analysis results of the AI model. The resulting numerical score objectively evaluates the user's performance.
[0642] Step 6:
[0643] Based on the evaluation results, the server compares them with past data to generate a practice plan tailored to the user's characteristics. This plan includes specific practice exercises and recommended songs.
[0644] Step 7:
[0645] The server sends the generated practice plan and evaluation results to the terminal.
[0646] Step 8:
[0647] The device displays the received practice plan and evaluation results in the user interface, allowing the user to review the content. The user then uses this information to begin training and aim to improve their singing ability.
[0648] (Example 1)
[0649] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0650] The present invention aims to provide a method for users to effectively improve their voice and singing skills. Current technology makes it difficult to generate training plans based on the individual characteristics of each user, and standard plans are often applied. Therefore, there is a challenge in that efficient training that takes into account the weaknesses and strengths of individual users cannot be achieved.
[0651] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0652] In this invention, the server includes a receiving means for acquiring voice information from a user, an evaluation means for analyzing the received voice information and evaluating singing skills, and a generating means for generating a training plan suitable for the user based on the analysis results. This makes it possible to provide a training plan optimized for each individual user.
[0653] "Voice information" refers to data that records the content of a user's speech and represents it in digital format.
[0654] A "user" is an individual who wishes to have their voice analyzed and trained through this system.
[0655] "Receiving means" refers to a device or program that has the function of acquiring voice information transmitted by a user and incorporating it into the system.
[0656] "Analysis" is the process of extracting and evaluating indicators such as pitch, rhythm, and sound quality from received audio information using a specific algorithm.
[0657] "Singing ability" is a general term for the musical expressiveness and technique possessed by the user.
[0658] "Evaluation means" refers to a device or program that has the function of objectively evaluating a user's singing ability based on data obtained through analysis.
[0659] "Generation means" refers to a device or program that has the function of constructing individual training plans based on evaluation results.
[0660] A "training plan" is a learning program that combines specific exercises and practice content with the aim of improving the user's singing skills.
[0661] "Means of delivery" refers to a device or program that has the function of presenting the generated training plan to the user visually or audibly.
[0662] A "generative artificial intelligence model" is a model built using machine learning algorithms and used for analyzing speech information.
[0663] This invention is a voice analysis system designed to help users improve their singing skills. This system supports the improvement of musical expressiveness by acquiring and analyzing the user's voice information and providing an appropriate training plan. Embodiments of this invention are described below.
[0664] Users record audio information using a dedicated application installed on their smartphone or PC. This application features a user-friendly interface, allowing users to easily record audio and send it to the system with the press of a button. The recorded audio information is transmitted from the device to the server via the internet. During transmission, the data is encrypted to ensure communication security.
[0665] The server stores the received audio information and uses a generative AI model to analyze pitch, rhythm, and tone quality. This AI model utilizes speech recognition technology to extract features from the audio data and evaluate singing ability. After analysis, the server generates a training plan tailored to the user's characteristics based on the evaluation results. This training plan includes specific training content to improve musical skills and is optimized considering the user's past data and similar datasets.
[0666] The generated training plan is sent from the server to the terminal and presented to the user visually within the application. Based on this feedback, the user can practice regularly, enabling them to efficiently improve their singing skills.
[0667] For example, if a user wants to improve their pitch accuracy, they can use the system to analyze their voice information and receive a pitch-specific training plan. By practicing according to this plan, the user can focus on specific musical challenges.
[0668] An example of a prompt would be, "Analyze my singing voice and create a training plan that clearly identifies areas for improvement." Through this prompt, the results of the speech analysis performed by the generated AI model can be optimally utilized.
[0669] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0670] Step 1:
[0671] The user launches a dedicated application and records their singing using the microphone on their smartphone or PC. The process begins by pressing the record button, and ends when the user taps the end button. During this step, the user's voice information is converted from an analog signal to a digital signal and saved to the device.
[0672] Step 2:
[0673] The terminal sends the recorded audio information file to the server via the internet. During transmission, the audio information is encrypted to ensure it reaches the server securely. The input for this step is a digital audio file, and the output is stored as secure data on the server.
[0674] Step 3:
[0675] The server decodes the received audio information in preparation for analysis and inputs it into a generating AI model. This model analyzes the audio information and generates indices to determine pitch, rhythm, and sound quality. The input is the decoded digital audio information, and the output is the analyzed singing skill index data.
[0676] Step 4:
[0677] Based on the analysis results, the server generates a training plan tailored to the user. It designs individually customized training plans using evaluation metrics. This process references historical user data and relevant datasets. The input is the analysis metric data, and the output is the generated training plan.
[0678] Step 5:
[0679] The server sends the generated training plan to the terminal. This plan is converted into a format that can be easily displayed on the user page or within the application. The transmitted data is encrypted to ensure the security of the communication. The input is the training plan data, and the output is the training plan stored on the terminal.
[0680] Step 6:
[0681] The device visually displays the received training plan to the user through the application interface. The user can perform daily training by reviewing the training content on the screen and following the instructions. The input for this step is the training plan stored on the device, and the output is converted into the user's own actions.
[0682] (Application Example 1)
[0683] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0684] In music and singing training, there is a challenge in obtaining real-time feedback tailored to individual vocal characteristics and weaknesses. Especially in self-study settings, a lack of specialized knowledge and training is common, hindering effective practice. Therefore, there is a need to provide means to effectively improve singing technique.
[0685] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0686] In this invention, the server includes a receiving means for acquiring voice data from the user, an evaluation means for analyzing the received voice data and evaluating the voice technique, and a generating means for generating a practice plan suitable for the user based on the analysis results. This enables the robot to record the user's vocalizations on the spot and provide evaluations in real time, thereby allowing for effective and continuous improvement of singing technique.
[0687] "Voice data" is a collection of digital signals that are recordings of the user's speech.
[0688] A "user" is an individual who wishes to improve their singing technique using this invention.
[0689] "Receiving means" refers to the process or device used to acquire audio data from the user.
[0690] "Evaluation method" refers to the process of analyzing received audio data and expressing voice skills using numerical values or indicators based on the results.
[0691] "Generative means" refers to the process of creating individual practice plans based on evaluation results.
[0692] A "practice plan" includes guidelines for individually customized exercises and training designed to improve the user's singing technique.
[0693] "Means of delivery" refers to the processes and devices used to communicate the generated practice plan to the user.
[0694] A "robot" refers to an autonomous machine or device that provides voice recording and feedback in real time.
[0695] "Real-time" refers to a process where data is processed and feedback is provided within a timeframe close to the moment it is acquired.
[0696] The system for realizing this invention requires a series of processes for acquiring, analyzing, evaluating, and providing feedback on audio data.
[0697] First, the user acquires voice data through the robot. The robot uses its built-in microphone to record the user's speech. This recorded data is then sent in real time to a locally built-in AI model.
[0698] The server analyzes the audio data received from the robot. This analysis utilizes generative AI models based on libraries such as TensorFlow and PyTorch. These models evaluate pitch, rhythm, and voice quality from the audio data and quantify the results.
[0699] Based on the evaluation results, the server generates an optimal training plan for the user. This training plan includes specific training instructions to address the user's weaknesses, compared to past training data. A database management system (DBMS) is used in this generation process to compare the current data with historical data.
[0700] Finally, the terminal provides the user with the generated practice plan. The robot provides real-time voice feedback using its built-in speaker and visually displays instructions and analysis results on its screen.
[0701] For example, if a user wants to improve the accuracy of a particular pitch, the robot can provide real-time feedback based on the recorded data, offering specific advice such as, "Let's aim for a slightly higher pitch." In this process, the prompt message would include instructions such as, "Analyze the following audio data and evaluate the pitch. Based on the analysis results, generate feedback and return specific advice regarding the pitch."
[0702] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0703] Step 1:
[0704] The user records their voice using the robot's built-in microphone. The input is the user's voice, and the output is saved as audio data. The robot converts the audio data into a digital format and prepares it for analysis.
[0705] Step 2:
[0706] The device sends recorded audio data to a server for analysis by a generating AI model. The input is audio data, and the AI model evaluates pitch, rhythm, and voice quality. The output is the analyzed data, with each metric represented numerically. Specifically, frameworks such as TensorFlow and PyTorch are used to extract features from the audio data.
[0707] Step 3:
[0708] The server generates an optimal training plan for the user based on the analysis results. The input consists of the analyzed data and past training data, and the output is a customized training plan. Data processing involves comparison with similar data sets, and individual training instructions are incorporated.
[0709] Step 4:
[0710] The device provides the user with a generated practice plan and offers audio and visual feedback. The input is the practice plan, and the output is notification information for the user. Specifically, it provides audio feedback through its built-in speaker and displays the practice content on the screen.
[0711] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0712] This invention provides a technology that combines an emotion engine with a voice analysis system that supports the improvement of users' singing abilities. The system acquires the user's singing voice, generates evaluations and practice plans based on that voice data, and at the same time identifies the user's emotional state and provides feedback and adjusts the plan accordingly.
[0713] Users record their singing voices using a dedicated application. This recording data is temporarily stored on the device and then sent to a server. The server analyzes the received data using an AI model to evaluate singing ability. It quantifies pitch, rhythm, voice quality, etc., and generates a numerical score.
[0714] Furthermore, the server analyzes changes in the user's voice tone, pitch, and speed, allowing the emotion engine to identify the user's emotional state. Based on this emotional data, it understands the user's mood while singing. This information is useful in generating practice plans, making it possible to suggest practice content tailored to specific emotional states.
[0715] Taking into account the emotional state identified by the emotion engine, the server generates a personalized practice plan. For example, if the user is determined to be stressed, it can provide vocal exercises to promote relaxation or training using familiar music. It can also provide positive feedback to boost the user's motivation.
[0716] Finally, the server sends the generated practice plan and emotion-based feedback to the device. The device displays these to the user on the application, providing guidance for the user to actually train. In this way, the user can train in a way that is optimized for them, receiving support not only for improving their singing ability but also emotionally.
[0717] For example, if a user uses this system to calm their nerves before a presentation, the emotional engine can detect their anxiety and suggest practice methods to promote relaxation. This allows the user to effectively improve both their mental state and their skills.
[0718] The following describes the processing flow.
[0719] Step 1:
[0720] The user launches a dedicated application and sings a selected song using the recording function. After finishing singing, the user presses the stop button, saving the singing voice as digital audio data to the device.
[0721] Step 2:
[0722] The terminal prepares to send the stored audio data to the server and transmits the data to the server via a communication protocol.
[0723] Step 3:
[0724] The server begins processing the received audio data. Using an AI model, it extracts indicators of pitch, rhythm, and vocal quality, and generates a numerical score related to singing ability.
[0725] Step 4:
[0726] The server further uses an emotion engine to identify the user's emotional state based on the tone, pitch, and speed of the voice extracted from the audio data. This emotional information is used as supplementary data for generating practice plans.
[0727] Step 5:
[0728] The server generates a personalized practice plan, taking into account the analysis results and the user's emotional information. This plan is optimized to improve specific singing techniques and is tailored to the user's mental state.
[0729] Step 6:
[0730] The generated practice plan and emotion-based feedback are sent from the server to the device. The device displays this information within the user interface, allowing the user to review the content.
[0731] Step 7:
[0732] Users review the displayed feedback and practice plan, and then follow the instructions to train. By repeating this process, users can improve their singing technique and emotional control.
[0733] (Example 2)
[0734] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0735] Conventional voice analysis systems can evaluate a user's singing ability, but they have difficulty providing training plans that take into account the user's emotional state. As a result, they have been unable to provide optimal training for the user, and have faced the challenge of not being able to effectively improve their singing ability.
[0736] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0737] In this invention, the server includes a receiving means for acquiring voice information from the user, an evaluation means for analyzing the received voice information and evaluating singing ability, and an identification means for analyzing the voice information and identifying the user's emotional state. This makes it possible to provide a more personalized training plan that takes into account the user's emotional state in addition to their singing ability.
[0738] "Voice information" refers to data obtained by recording the user's voice and expressing it in a format that the system can analyze.
[0739] A "user" is an individual who uses a voice analysis system with the aim of improving their own singing ability.
[0740] "Receiving means" refers to a method or device for the system to acquire voice information transmitted by a user.
[0741] "Analysis" refers to the analytical process performed by the system to evaluate and identify singing ability and emotional state based on acquired audio information.
[0742] "Evaluation means" refers to a method or apparatus for analyzing audio information and expressing singing ability as a numerical value or indicator.
[0743] "Generation means" refers to a method or apparatus for creating a training plan suitable for the user based on the analysis results.
[0744] "Means of delivery" refers to a method or device for presenting the generated training plan to the user and using it as guidance for the training.
[0745] "Identification means" refers to a method or device for identifying a user's emotional state based on voice information.
[0746] "Adjustment means" refers to a method or apparatus for optimizing a training plan based on identified emotional states.
[0747] "Emotional state" refers to the psychological or emotional condition determined based on the user's vocal characteristics.
[0748] This invention relates to a voice analysis system that analyzes a user's singing ability and emotional state and provides an individualized training plan. The system operates primarily with the involvement of a server, a terminal, and the user.
[0749] First, users install a dedicated application on their device and use the recording function to collect their singing voice. The recorded audio data is temporarily stored on the device and then sent to a server via a communication network. At this time, the device appropriately encodes the audio information and sends the data to the server using a secure protocol.
[0750] The server first inputs the received audio information into a generating AI model. The model analyzes technical elements such as pitch, rhythm, and tone quality, and quantifies and evaluates the user's singing ability. This evaluation helps identify areas where the user should improve. Following the evaluation, the tone, pitch, and speed of the audio information are analyzed, and an emotion engine is used to identify the user's emotional state. This step identifies the user's current emotional state.
[0751] Next, the server generates an optimized training plan for each user based on the analysis results and emotional state. For example, if tension is detected, it provides a plan that includes exercises to promote relaxation. This training plan helps users improve their singing ability efficiently and continuously.
[0752] Finally, the generated training plan and feedback are sent to the device and made available to the user within the application. The user can then refer to this information to perform appropriate training at home or elsewhere.
[0753] As a concrete example, let's look at an example of a prompt message.
[0754] "Evaluate this singing voice, analyze the user's emotions based on their tone and pitch, and then propose an optimal voice training plan tailored to those emotions."
[0755] In this way, the system allows users to achieve not only technical singing skills but also emotional growth.
[0756] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0757] Step 1:
[0758] The user records their singing voice using a dedicated application. The recorded audio information is temporarily stored on the device. At this stage, the input is the user's raw singing voice, which is stored on the device as digital audio data. Specifically, the audio is captured through the device's microphone and encoded into WAV or MP3 format data.
[0759] Step 2:
[0760] The device sends the stored audio data to the server. During transmission, the audio data is appropriately encoded and encrypted before being sent to the server. The input for this step is the encoded audio data, and the output is the data as it arrives at the server. Specifically, network protocols (e.g., HTTPS) are used to ensure data security during transmission.
[0761] Step 3:
[0762] The server analyzes the received audio data. Using the received audio data as input, the server's AI model analyzes pitch, rhythm, and sound quality, generating a numerical score. The output provides a detailed numerical evaluation for each metric. Specifically, the AI model uses deep learning techniques to analyze the data and evaluate singing skills.
[0763] Step 4:
[0764] The server uses an emotion engine to identify the user's emotional state from the voice data. The input is changes in tone, pitch, and speed of the voice, and the output is the identified emotional state of the user. Specifically, an emotion analysis algorithm using voice features is employed.
[0765] Step 5:
[0766] The server generates an optimal training plan for the user based on the analysis results and emotional state. The input is the singing skill evaluation and emotional identification results obtained so far, and the output is an individualized training plan. Specifically, the generating AI model combines each piece of data to construct the training content.
[0767] Step 6:
[0768] The server sends the generated training plan and feedback to the terminal. The input here is the generated training plan and feedback, and the output is that it is delivered accurately to the terminal. Specifically, the data is encrypted and transmitted using a communication protocol.
[0769] Step 7:
[0770] The application displays the training plan and feedback received by the device to the user. The input is data received from the server, and the output is visual guidance that the user can see and confirm. Specifically, the application provides buttons and menus that prompt users to display and operate the information based on UI / UX design.
[0771] Step 8:
[0772] The user actually performs the training. The input is the training plan and feedback provided on the device, and the output is the improved singing skills and emotional state achieved through the training. Specifically, the user follows instructions, practices singing, and records their own progress.
[0773] (Application Example 2)
[0774] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0775] Traditional voice analysis systems focus on evaluating a user's singing ability, but they do not provide feedback or practice plans that take into account the user's emotional state. Therefore, they fail to maximize user psychological satisfaction and practice effectiveness, making it difficult to provide a learning experience optimized for each individual user.
[0776] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0777] In this invention, the server includes an acquisition means for acquiring audio data from the user, an evaluation means for analyzing the acquired audio data and evaluating singing ability, a generation means for generating a practice plan suitable for the user based on the analysis results, an identification means for identifying the user's emotional state and providing feedback according to that state, and a provision means for providing the generated practice plan and feedback to the user. This makes it possible not only to improve the user's singing ability but also to provide an individualized practice experience that takes into account their emotional state.
[0778] "Voice data" refers to information obtained by converting a user's speech into digital signals.
[0779] "Acquisition means" refers to a method or apparatus for collecting voice data from a user.
[0780] "Analysis" is the process of extracting specific indicators and features based on audio data and evaluating them.
[0781] "Evaluation means" refers to a method or apparatus for analyzing acquired audio data to quantify or index singing ability.
[0782] "Generation means" refers to a method or apparatus for creating a practice plan suitable for the user based on the analysis results.
[0783] "Identification means" refers to a method or apparatus for identifying a user's emotional state from audio data.
[0784] "Feedback" is the process of returning information to the user based on the results of voice analysis and their emotional state.
[0785] "Means of delivery" refers to a method or device for presenting the generated practice plan and feedback to the user.
[0786] The system that realizes this invention works by having the user record their singing voice using a dedicated application and send the data to a server. The terminal first uses a microphone to record the voice and temporarily saves the recorded data to local storage. Then, it uploads this to the server. The server uses an AI model, such as TensorFlow, to analyze the singing ability from the received voice data. Key indicators such as pitch, rhythm, and vocal characteristics are calculated.
[0787] Furthermore, the server uses an emotion engine to identify the user's emotional state based on the voice data. This emotion engine analyzes changes in voice tone, pitch, and speed to determine the user's psychological state. Emotion analysis is performed using technologies such as Amazon Rekognition.
[0788] Based on this data, the server generates a personalized practice plan. This plan takes into account the analyzed singing skills and emotional state, and incorporates specific skill-building training and emotional support.
[0789] Finally, the generated practice plan and feedback are sent to the device. Users can review this on their smartphone or tablet and practice appropriately. For example, if the user is feeling nervous, singing exercises to help them relax will be provided.
[0790] For example, if a user wants to alleviate nervousness before a presentation, the emotion engine will detect their level of tension and suggest relaxing exercises. This allows the user to improve both their mental state and singing skills. An example of a prompt to the generative AI model would be, "Please suggest relaxing singing exercises. The user is currently feeling nervous."
[0791] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0792] Step 1:
[0793] The user records audio using their device. The input is the user's raw singing voice, and the output is digitized audio data. The audio is captured using a microphone and temporarily stored in local storage.
[0794] Step 2:
[0795] The terminal sends audio data to the server. The input is the audio data stored on the terminal, and the output is the audio data transferred to the server. A communication module is used to transmit the data.
[0796] Step 3:
[0797] The server analyzes the received audio data using an AI model. The input is the audio data sent to the server, and the output is evaluation metrics such as pitch, rhythm, and voice quality. TensorFlow and similar tools are used to analyze the data and generate numerical scores.
[0798] Step 4:
[0799] The server analyzes the tone, pitch, and speed of the audio data to identify the user's emotional state. The input is audio data, and the output is the result of the emotional state identification. This is achieved using an emotion engine, such as Amazon Rekognition.
[0800] Step 5:
[0801] The server generates individual practice plans based on vocal ability evaluations and emotional states. The inputs are analyzed evaluation metrics and emotional states, and the output is a customized practice plan. Optimization processing is performed using the generation method.
[0802] Step 6:
[0803] The server sends the generated practice plan and emotion-based feedback to the terminal. The input is the practice plan and feedback content, and the output is the information presented to the terminal. Data is transmitted via a communication protocol.
[0804] Step 7:
[0805] The user checks feedback from their device and practices the provided exercises. The input is the exercise plan and feedback displayed on the device, and the output is the user's actions. Feedback is displayed and guidance is provided to the user using the device interface.
[0806] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0807] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0808] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.
[0809] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.
[0810] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.
[0811] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.
[0812] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.
[0813] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.
[0814] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."
[0815] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.
[0816] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.
[0817] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.
[0818] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.
[0819] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.
[0820] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.
[0821] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.
[0822] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.
[0823] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.
[0824] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.
[0825] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.
[0826] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.
[0827] The following is further disclosed regarding the embodiments described above.
[0828] (Claim 1)
[0829] A receiving means for acquiring audio data from the user,
[0830] An evaluation method that analyzes received audio data to evaluate singing ability,
[0831] A generation method that generates a practice plan suitable for the user based on the analysis results,
[0832] A means of providing the generated practice plan to the user,
[0833] A system that includes this.
[0834] (Claim 2)
[0835] The system according to claim 1, wherein the evaluation means generates various indicators of pitch, rhythm, and voice quality.
[0836] (Claim 3)
[0837] The system according to claim 1, wherein the generation means optimizes the practice plan based on past user data and similar datasets.
[0838] "Example 1"
[0839] (Claim 1)
[0840] A receiving means for acquiring voice information from the user,
[0841] An evaluation method that analyzes received audio information to evaluate singing ability,
[0842] A generation means that generates a training plan suitable for the user based on the analysis results,
[0843] A means of providing the generated training plan to the user,
[0844] A means of using a generative artificial intelligence model to analyze speech information,
[0845] A means of visually displaying the generated training plan,
[0846] A system that includes this.
[0847] (Claim 2)
[0848] The system according to claim 1, wherein the evaluation means generates various indicators of pitch, rhythm, and sound quality.
[0849] (Claim 3)
[0850] The system according to claim 1, wherein the generation means optimizes the training plan based on information of past users and similar information sets.
[0851] "Application Example 1"
[0852] (Claim 1)
[0853] A receiving means for acquiring audio data from the user,
[0854] An evaluation method for analyzing received audio data and evaluating voice technology,
[0855] A generation means that generates a practice plan suitable for the user based on the analysis results,
[0856] A means of providing the generated practice plan to the user visually and audibly,
[0857] A method by which a robot records the user's speech on the spot and provides evaluation in real time,
[0858] An automated system including
[0859] (Claim 2)
[0860] The automated system according to claim 1, wherein the evaluation means generates various indicators of pitch, timing, and sound quality, and further presents these as audio feedback using an audio playback function.
[0861] (Claim 3)
[0862] The automation system according to claim 1, wherein the generation means optimizes the practice plan based on past user data and similar data sets, and provides continuous practice support through interaction with a robot.
[0863] "Example 2 of combining an emotion engine"
[0864] (Claim 1)
[0865] A receiving means for acquiring voice information from the user,
[0866] An evaluation method that analyzes received audio information to evaluate singing ability,
[0867] A generation means that generates a training plan suitable for the user based on the analysis results,
[0868] A means of providing the generated training plan to the user,
[0869] An identification means that analyzes voice information and identifies the user's emotional state,
[0870] An adjustment mechanism that adjusts the training plan based on the identified emotional state,
[0871] A system that includes this.
[0872] (Claim 2)
[0873] The system according to claim 1, wherein the evaluation means generates various indicators of pitch, rhythm, and sound quality.
[0874] (Claim 3)
[0875] The system according to claim 1, wherein the generation means optimizes the training plan based on information of past users and similar information sets, and further takes into account the emotional state obtained by the identification means.
[0876] "Application example 2 when combining with an emotional engine"
[0877] (Claim 1)
[0878] A means of obtaining voice data from the user,
[0879] An evaluation method for analyzing acquired audio data to assess singing ability,
[0880] A generation means that generates a practice plan suitable for the user based on the analysis results,
[0881] An identification means that identifies the user's emotional state and provides feedback according to that state,
[0882] A means of providing the generated practice plan and feedback to the user,
[0883] A system that includes this.
[0884] (Claim 2)
[0885] The system according to claim 1, wherein the evaluation means generates various indicators of tone, rhythm, and voice characteristics.
[0886] (Claim 3)
[0887] The system according to claim 1, wherein the generation means optimizes the practice plan based on information from past users and similar information sets, and makes adjustments based on emotional state. [Explanation of symbols]
[0888] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>
Claims
1. A receiving means for acquiring audio data from the user, An evaluation method that analyzes received audio data to evaluate singing ability, A generation method that generates a practice plan suitable for the user based on the analysis results, A means of providing the generated practice plan to the user, A system that includes this.
2. The system according to claim 1, wherein the evaluation means generates various indicators of pitch, rhythm, and voice quality.
3. The system according to claim 1, wherein the generation means optimizes the practice plan based on past user data and similar datasets.