system
A system using AI analysis of audio data to generate personalized singing training plans addresses inefficiencies in existing methods, offering effective and affordable singing improvement.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- SOFTBANK GROUP CORP
- Filing Date
- 2024-12-16
- Publication Date
- 2026-06-26
AI Technical Summary
Existing singing training methods are inefficient, costly, and do not cater to individual needs, lacking effective and accessible means for improving singing ability.
A system that uses a device to collect audio data, transmit it to a server for AI analysis, and generate personalized voice training plans based on the analysis results, incorporating a database for optimizing training content.
Provides objective, efficient, and individually tailored singing training plans, enabling users to improve their singing skills effectively and affordably.
Smart Images

Figure 2026105489000001_ABST
Abstract
Description
Technical Field
[0001] The technology of the present disclosure relates to a system.
Background Art
[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor, and includes steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance as a response to the user utterance.
Prior Art Documents
Patent Documents
[0003]
Patent Document 1
Summary of the Invention
Problems to be Solved by the Invention
[0004] Many people want to improve their singing ability, but in general, they do not know how to effectively improve it. In addition, hiring a professional voice trainer is costly and has significant time constraints. Therefore, there is a need for easily accessible, efficient, and low-cost support for improving singing ability. However, existing training methods do not meet individual needs and have the problem of remaining inefficient learning through self-styled practice.
Means for Solving the Problems
[0005] This invention provides a means for objectively evaluating singing ability by collecting a user's singing voice using a device that acquires audio data, transmitting the acquired audio data to a server, and analyzing the audio data using AI. Furthermore, it includes a means for generating and providing a voice training plan tailored to the user's individual needs based on the analysis results. This system enables users to efficiently improve their singing ability by referencing a past learning database and optimizing the training plan.
[0006] "Audio data" refers to data that records human voices in digital format.
[0007] A "device" is an electronic device used to record a user's voice and store it as data.
[0008] A "server" is a computer system that has the ability to receive audio data via a network and analyze that data.
[0009] "AI" stands for artificial intelligence, and it is a technology that uses machine learning algorithms to analyze voice data and evaluate singing ability.
[0010] "Analysis" is the process of mechanically analyzing received audio data, extracting its constituent elements, and evaluating them.
[0011] "Singing ability" is the skill that is evaluated based on elements such as pitch, rhythm, tone quality, and expressiveness in musical vocal expression.
[0012] A "voice training plan" is a practice menu individually designed to improve the user's singing ability.
[0013] A "database" is a digital recording device that systematically stores past analysis data and training results.
[0014] "Optimizing a training plan" means adjusting and improving the training content to most effectively meet the individual needs and goals of the user. [Brief explanation of the drawing]
[0015] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14]It is a sequence diagram showing the processing flow of a data processing system in Application Example 2 when a sentiment engine is combined.
Embodiments for Carrying Out the Invention
[0016] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.
[0017] First, the terms used in the following description will be explained.
[0018] In the following embodiments, a numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), etc.
[0019] In the following embodiments, a numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.
[0020] In the following embodiments, a numbered storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, etc.
[0021] In the following embodiments, the signed communication interface (I / F) is an interface that includes a communication processor and an antenna, etc. The communication interface manages communication between multiple computers. Examples of communication standards applicable to the communication interface include wireless communication standards such as 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).
[0022] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."
[0023] [First Embodiment]
[0024] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.
[0025] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.
[0026] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0027] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.
[0028] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.
[0029] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.
[0030] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.
[0031] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.
[0032] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.
[0033] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0034] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0035] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0036] This invention is a system that acquires voice data and provides the user with optimal voice training based on the analysis results. In implementation, the system mainly consists of a terminal for collecting voice data, a server for analysis, and the user.
[0037] First, the user records audio using their device. The device is an electronic device such as a smartphone or tablet, and the user launches a recording application. The user confirms that the microphone is properly connected to the device, presses the record button, and sings a portion of the designated song.
[0038] Once recording is complete, the device sends the audio data to the server. The audio data is converted to a digital format and securely sent to the server via the internet along with the user's identification information. The server receives this data and begins its analysis.
[0039] The server uses machine learning algorithms to analyze the received audio data. The analysis primarily measures features such as pitch, rhythm, volume, and tone quality. The server comprehensively evaluates these features and generates a numerical singing ability score.
[0040] The server then consults a database based on the evaluation score to create an optimal voice training plan for the user. For example, if the user's pitch is evaluated as unstable, the server can generate a training plan that includes scale exercises.
[0041] Finally, the server sends back the generated voice training plan and evaluation results to the terminal. The terminal visualizes this information on its user interface. The user can then review this information and use it to improve their daily voice training.
[0042] This system provides users with a personalized and efficient singing improvement program. For example, to improve pitch accuracy, the server might suggest that the user repeatedly practice specific scales or melodic patterns. Thus, this system becomes a useful tool for all users who want to improve their singing skills.
[0043] The following describes the processing flow.
[0044] Step 1:
[0045] The user launches a recording application on their device and configures the audio input device appropriately. When the user presses the record button, recording begins, and audio data is collected via the device's microphone.
[0046] Step 2:
[0047] The device formats the recorded audio as a digital audio file and temporarily stores the audio data in memory. After recording is complete, the user can stop recording by pressing the stop recording button.
[0048] Step 3:
[0049] The device packets the stored voice data and associated user identification information and sends them to the server using a secure protocol (e.g., HTTPS).
[0050] Step 4:
[0051] The server receives the audio data, decodes it, and verifies its integrity. If there are no problems, the audio data is input into a machine learning algorithm to begin analysis.
[0052] Step 5:
[0053] The server performs audio analysis and extracts audio features such as pitch, rhythm, and volume. Based on this, it evaluates the user's singing ability and generates a numerical score.
[0054] Step 6:
[0055] The server consults a database and creates a voice training plan recommended to the user based on their evaluation score. This includes specific instructions for scale practice and improving expressiveness.
[0056] Step 7:
[0057] The server sends the generated evaluation results and training plan to the terminal.
[0058] Step 8:
[0059] The device displays the received data on the user interface, allowing the user to review training results and suggestions. Based on this, the user proceeds with their daily practice.
[0060] (Example 1)
[0061] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0062] Providing an effective voice training plan to improve individual singing abilities is challenging. Existing technologies rely heavily on subjective evaluation of voice information and uniform training plans, resulting in insufficient feedback tailored to individual user characteristics. Furthermore, developing such plans is time-consuming and labor-intensive, hindering efficient improvement of singing abilities.
[0063] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0064] In this invention, the server includes means for analyzing voice information and evaluating singing ability, means for generating an individual voice training plan based on the evaluation results, and means for providing the generated voice training plan to the user. This makes it possible to provide the user with objective, efficient, and individually optimized voice training.
[0065] "Audio information" refers to digital or analog data recorded in a form that includes the frequency, intensity, and temporal changes of sound.
[0066] The term "device" refers to a combination of rigid and flexible elements used to achieve a specific function or operation.
[0067] An "analysis device" refers to computing resources used to process input data and extract or analyze specific information.
[0068] "Singing ability" is an evaluation of the ability to sing with musically accurate pitch, rhythm, volume, and tone quality.
[0069] An "evaluation indicator" is a way of expressing a specific ability or performance using numerical values or a scale.
[0070] A "voice training plan" refers to a plan that includes specific practice content and steps to improve the characteristics of the subject's voice.
[0071] "User" refers to an individual or group that operates this system and is eligible to receive a training plan.
[0072] An "information storage device" refers to a medium or device that stores data and information and allows it to be retrieved as needed.
[0073] To implement this invention, the user first records audio information using a device such as a smartphone or tablet. A common recording application is installed on the device, and the microphone is configured to function correctly. The user presses the record button and sings a portion of a designated song to collect audio information. Once the recording is complete, the device converts the audio information into a digital format and sends it to a server via the internet. Encryption technology is used to ensure the security of the data during transmission.
[0074] The server can utilize speech analysis software to analyze the received audio information. Specifically, it uses the Python LibROSA library and machine learning algorithms to measure characteristics such as pitch, rhythm, volume, and sound quality, and generates evaluation metrics. These evaluation metrics are compared to standard data by a generative AI model, and the user's singing ability is quantified.
[0075] The server generates an individualized voice training plan based on the analysis results. This plan is created using various training methods and historical training data stored in the database. By referring to historical data, an optimized plan tailored to the user's characteristics is provided. For example, if the server determines that the user's pitch is unstable, it can provide a training plan that includes pitch practice and melody pattern practice.
[0076] Finally, the server sends the generated voice training plan and evaluation results back to the terminal. The terminal displays this information in its user interface, allowing the user to refer to the results and training plan. The user can then perform daily voice training based on the displayed information.
[0077] As a concrete example, a prompt message to be input to the generating AI model could be something like, "Please calculate an evaluation score using the user's voice data and generate a training plan." This system allows users to efficiently improve their singing ability.
[0078] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0079] Step 1:
[0080] The user records audio information using the device. After the user launches the recording application and confirms that the microphone is securely connected, they press the record button and sing the designated song. This input is the user's raw voice data. When recording is complete, the device converts the recorded analog audio into a digital format. The output is the digitized audio data.
[0081] Step 2:
[0082] The terminal sends digital audio data to the server. Before transmission, the audio data is encrypted along with the user's identification information. The input is digitized audio data, and the output is encrypted data transmitted to the server via the internet.
[0083] Step 3:
[0084] The server receives audio data and begins analysis. The input is encrypted audio data, which is first decrypted. The decrypted data is fed into a machine learning algorithm to extract features such as pitch, rhythm, volume, and sound quality. Audio processing libraries such as LibROSA can be used in this analysis process. The output is the feature vectors extracted from the audio data.
[0085] Step 4:
[0086] The server generates evaluation metrics based on extracted features. The input consists of features from audio data, and a generative AI model is used to compare these features with standard data, thereby quantifying the user's singing ability. The output is a numerical value representing the evaluation metric.
[0087] Step 5:
[0088] The server generates individualized voice training plans based on evaluation metrics and by referencing a database. Inputs include evaluation metrics and user characteristic data. The server selects the optimal training menu from historical data and training samples to create an individualized plan. The output is a voice training plan optimized for the user.
[0089] Step 6:
[0090] The server sends the generated voice training plan and evaluation metrics back to the terminal. The input is the voice training plan and evaluation metrics, which are retransmitted to the terminal via the internet. The output is the training plan and evaluation results sent to the terminal.
[0091] Step 7:
[0092] The terminal displays the received information on the user interface. The input consists of a voice training plan and evaluation metrics, which are visualized on the screen in a user-friendly format. Users can use this output to improve their daily voice training.
[0093] (Application Example 1)
[0094] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0095] In recent years, there has been a growing demand for entertainment and learning support using home-use electronic devices. However, a challenge remains: there is a lack of effective means to improve users' singing abilities while providing individualized feedback. Conventional systems offer limited evaluation and feedback, making it difficult for users to efficiently improve their singing skills.
[0096] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0097] In this invention, the server includes means that is a device for acquiring voice information, means that transmit the acquired voice information to an information processing device for analysis, and means that provide feedback that allows the user to improve their singing ability while interacting with the home-use device. This makes it possible for the user to receive individually customized feedback in real time within their home.
[0098] "Audio information" refers to data related to the voice and singing performances emitted by users.
[0099] "Device" refers to a physical device that has the function of acquiring and transmitting audio information.
[0100] An "information processing device" refers to a system that analyzes received audio information and performs calculations to generate appropriate indicators and feedback.
[0101] "Analysis" refers to the process of extracting features from audio information using specific machine learning models or algorithms and generating evaluation metrics.
[0102] "Evaluation metrics" refer to numerical or qualitative criteria related to a user's singing ability and characteristics, calculated based on analyzed audio information.
[0103] "Singing ability" refers to the user's singing ability and characteristics, including pitch, rhythm, and tone quality.
[0104] "Feedback" refers to information that includes specific areas for improvement and points of concern, provided to users based on the analyzed results.
[0105] "Household machinery and devices" refers to robots and electronic devices used in the home environment, such as for entertainment or learning support.
[0106] The system for implementing this invention has a series of processes for acquiring voice information, analyzing it, and providing feedback to the user. First, the device used by the user is equipped with a microphone and has the function of acquiring voice information. The acquired voice information is transmitted via the internet to a server, which is an information processing device. In this process, the data is transmitted securely using security protocols.
[0107] The server uses machine learning algorithms to analyze audio information. Specifically, machine learning tools such as TENSORFLOW® are used to identify audio characteristics such as pitch, rhythm, volume, and tone quality. The server analyzes these characteristics and generates evaluation metrics. Based on the evaluation results, an individual singing training plan is generated. This process also references data from past data storage devices.
[0108] The generated feedback and training plans are then returned to the user's device via the internet. Users can interact with the home-use device and use the generated feedback to improve their singing ability. This system enables real-time feedback; for example, when a user sings a particular song, they might receive specific feedback such as, "Your pitch is slightly off. Try singing it a little slower."
[0109] By using a generative AI model for analysis, users can receive personalized feedback in real time. As an example of this process, the prompt message to the generative AI model would be, "Analyze the following audio data to evaluate the user's singing ability and generate and present an appropriate training plan."
[0110] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0111] Step 1:
[0112] The device acquires the user's voice information through the microphone. The input is the user's singing data, and the output is an audio file converted into a digital format. The device sends this audio file to the server via the internet.
[0113] Step 2:
[0114] The server first inputs the received audio file into a machine learning algorithm to analyze it. The input is the audio file, and the output is audio features (pitch, rhythm, volume, sound quality, etc.). The server uses TensorFlow to extract audio features and generate evaluation metrics.
[0115] Step 3:
[0116] The server creates an individualized singing training plan for each user based on the generated evaluation metrics. The input is the evaluation metrics obtained through voice analysis, and the output is the details of the training plan. The server optimizes the training plan by referring to past database data.
[0117] Step 4:
[0118] The server sends the generated singing training plan and evaluation metrics to the terminal. The input is the training plan and evaluation metrics, and the output is the status of successful transmission to the terminal. The server sends the data to the terminal using a secure protocol.
[0119] Step 5:
[0120] The terminal visualizes the received singing training plan and evaluation metrics on the user interface. The input is data received from the server, and the output is the feedback screen the user sees. The terminal provides this information to the user in an easy-to-understand graphical format.
[0121] Step 6:
[0122] Users practice improving their singing ability based on feedback while interacting with a home-use device. Input is feedback and training plans on the device, and output is the user's improved singing ability. Users can adjust pitch and repeat rhythm exercises according to instructions.
[0123] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0124] This invention is a system that combines a system for improving a user's singing ability with an emotion engine that recognizes the user's emotions. This system mainly consists of a terminal, a server, and an emotion engine.
[0125] In use, the user first launches the application on their device and performs voice recording. The device captures the user's singing voice through the microphone and converts the audio data into a digital format. This data, along with the user's identification information, is transmitted to the server in a secure manner.
[0126] The server receives audio data using a secure protocol, decodes it, and verifies its integrity. Next, it uses machine learning algorithms and an emotion engine to analyze the received audio data. Here, the machine learning algorithms extract audio features such as pitch, rhythm, and volume, and quantify singing ability. Simultaneously, the emotion engine analyzes the emotions in the audio and recognizes the user's emotional state.
[0127] Once the analysis is complete, the server considers the features and evaluation scores obtained from the audio, as well as the user's emotional state, to design a personalized voice training plan. For example, if it detects that the user is nervous, it can suggest feedback and practice exercises to help them relax.
[0128] Next, the server sends the generated voice training plan and emotion-based feedback to the device. The device visualizes this information on the user interface and provides it to the user. The user can then carry out their daily training based on the suggestions they receive.
[0129] For example, if the emotion engine detects "anxiety" during voice training, the server will not only adjust the voice but also suggest a relaxation time, programming the system to help the user reduce stress. By considering the user's emotional state in this way, more effective improvement in singing ability can be expected. This invention is technically easy to implement and provides an efficient training environment by flexibly responding to the individual needs of the user.
[0130] The following describes the processing flow.
[0131] Step 1:
[0132] The user launches the application on their device and prepares to record. By pressing the record button, the device uses the microphone to record the user's singing voice and saves it as digital audio data.
[0133] Step 2:
[0134] The device converts the recorded audio data into a digital format and sends it to the server along with the user's identification information. The transmission is performed using a secure protocol.
[0135] Step 3:
[0136] The server verifies the integrity of the received audio data and analyzes it using a machine learning algorithm. This analysis extracts audio features (pitch, rhythm, volume) and generates a singing ability score.
[0137] Step 4:
[0138] The server uses an emotion engine to analyze emotional information within the voice data and identify the user's emotional state. For example, it determines emotions from factors such as voice tone and tempo.
[0139] Step 5:
[0140] The server combines voice characteristics and user emotions to design a personalized voice training plan. For example, it includes practice exercises tailored to a user who needs to relax.
[0141] Step 6:
[0142] The server sends the generated training plan and feedback to the user's device. Sentiment-based messages are also included as needed.
[0143] Step 7:
[0144] The device displays a training plan and feedback to the user through its user interface. The user follows the instructions received and begins their daily training.
[0145] Step 8:
[0146] The user continues practicing and returns to the cycle of recording again. The system facilitates continuous improvement through this cycle.
[0147] (Example 2)
[0148] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0149] Conventional voice training systems only quantify and evaluate a user's voice ability, making it difficult to provide training plans that take into account the user's emotional state. As a result, emotions such as tension and anxiety can affect singing ability, leading to challenges in providing effective training. Furthermore, generating training plans that meet the individual needs of each user requires more flexible and detailed analysis.
[0150] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0151] In this invention, the server includes means for analyzing voice information and evaluating voice ability, means for recognizing emotional states in the voice information using an emotion analysis device, and means for generating individual voice training plans based on the analysis results. This makes it possible to provide more effective and personalized training plans that also take into account the user's emotional state.
[0152] "Audio information" refers to data that represents speech in digital format, and is the subject of capturing and analyzing the user's voice.
[0153] "Device" refers to equipment or devices used to acquire audio information, such as microphones and terminals equipped with recording functions.
[0154] A "central processing unit" refers to a system with computing power, such as a server or cloud computer, that centrally analyzes and processes audio information.
[0155] "Vocal ability" refers to characteristics that indicate a user's singing ability and vocalization ability, and is an evaluation index that is quantified through the analysis of voice information.
[0156] An "emotion analysis device" refers to a system equipped with a mechanism or algorithm to recognize emotional information from voice information, and is a technology for analyzing a user's emotional state.
[0157] A "voice training plan" is a training program or instructional guideline designed to improve a user's voice abilities, generated based on their analyzed voice abilities and emotional state.
[0158] An "information storage system" refers to a database system that stores past data and user history, and allows users to access that data as needed.
[0159] This invention is a voice training system aimed at improving a user's voice abilities, and is characterized by its inclusion of an emotion analysis function. This system mainly consists of a terminal, a central processing unit, and an emotion analysis device, and provides the user with an individualized voice training plan through the acquisition, analysis, and provision of feedback of voice information.
[0160] The user first launches an application on their device and records their voice using the microphone. The device then converts the acquired voice into a digital format and transmits it to a central processing unit using a secure protocol. The central processing unit utilizes machine learning algorithms to analyze the received voice information and evaluate the user's voice abilities. It also recognizes emotions in the voice using an emotion analyzer and incorporates this into the training plan.
[0161] Specifically, the voice training plan generated by the central processing unit is adjusted based on the user's voice characteristics and emotional state. For example, if the system recognizes that the user is feeling anxious during recording, the training plan may include exercises to promote relaxation. Furthermore, the analysis results are quantified as evaluation metrics, allowing the user to track their progress.
[0162] This system is expected to enable a more personalized approach to users and improve their voice abilities more effectively compared to conventional technologies. An example of a prompt generated by the AI model is, "Analyze the singing voice recorded by the user and generate an optimal training plan based on the voice characteristics and emotional state." This allows users to receive scientifically-based and effective training.
[0163] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0164] Step 1:
[0165] The user launches an application on the device and starts recording voice. The user records their voice through the microphone. The device converts the voice into a digital format and prepares it as voice information along with the user's identification information. This input consists of the user's voice data and identification information, which forms the basis for transmission to the central processing unit.
[0166] Step 2:
[0167] The terminal sends the prepared audio information to the server using a secure protocol (e.g., HTTPS). The transmitted audio information becomes the server's input. The server decodes the received audio information and verifies the integrity of the data. This process checks the integrity of the audio data and prepares it for analysis.
[0168] Step 3:
[0169] The server analyzes the audio information. The input is the audio data decoded in step 2. Machine learning algorithms are used for the analysis to extract audio features (pitch, rhythm, volume, etc.). This data processing quantifies the voice ability and generates evaluation metrics. Simultaneously, an emotion analysis device operates to identify the emotional states contained in the audio. The output of this process is the voice evaluation score and emotional state data.
[0170] Step 4:
[0171] The server generates individualized voice training plans based on speech features and emotional states. The input is the evaluation score and emotional state obtained in step 3, and a generative AI model is used to create prompt sentences based on this. Specifically, the plan includes concrete instruction such as "incorporate relaxation methods to alleviate user tension into the training plan." The output of this step is the individualized voice training plan.
[0172] Step 5:
[0173] The server sends the created voice training plan to the terminal. The terminal decodes the received plan and displays it in the user interface. Based on the displayed plan, the user can perform their daily training. The output of this final step is the voice training plan and feedback that the user receives.
[0174] (Application Example 2)
[0175] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".
[0176] In recent years, training systems aimed at improving users' voice abilities have become widespread. However, these systems primarily focus on evaluating abilities based on the characteristics of voice data and do not provide training that takes into account the user's inner emotional state. Therefore, there is a need to develop a system that accurately grasps the user's emotional state and effectively improves voice abilities through feedback based on that understanding.
[0177] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0178] In this invention, the server includes means for analyzing voice data and evaluating voice ability, means for analyzing the user's emotional state from the acquired voice data, and means for generating emotion-appropriate feedback based on the user's emotional state. This enables personalized voice training that takes the user's emotions into consideration.
[0179] "Audio data" refers to data obtained by digitizing an audio signal and converting it into a format that can be processed.
[0180] A "device" is hardware or a combination of hardware designed to perform a specific function.
[0181] A "central processing unit" is a computer system or server used to receive, analyze, and process data.
[0182] "Vocal ability" refers to a numerical or evaluative representation of vocal expressiveness and technique.
[0183] A "voice training plan" is a plan that combines a series of practice methods and activities to improve voice ability.
[0184] A "user" is an individual or agent who uses the system.
[0185] "Emotional state" refers to the user's internal psychological condition, analyzed based on voice data.
[0186] "Feedback" refers to information or advice provided to the user based on the analysis results.
[0187] An "information collection" is a set of data, including past data and history, that is used for analysis and optimization.
[0188] The server plays a central role in this system, analyzing the user's voice data. The user acquires their own voice using a device equipped with voice recording capabilities. For example, a household robot can function as this device. The voice data is transmitted from the terminal to the central processing unit, where it is stored as digital data.
[0189] The server uses a Python-based speech analysis library (e.g., librosa) to extract features from the speech data. Speech ability is then numerically evaluated by a machine learning model based on these features. Furthermore, an emotion analysis engine (e.g., IBM Watson® Emotion Analysis) is used to understand the emotional state from the acquired speech. This process reveals the user's inner emotions.
[0190] Next, the server generates a voice training plan that takes into account the user's emotional state and voice evaluation. For example, if the user is identified as anxious, simple breathing exercises to help them relax are added to the training plan. These plans are further refined by leveraging optimization algorithms based on underlying data and by comparing them with past data sets.
[0191] The generated training plan and feedback are then sent to the device and visualized through an interactive user interface. The robot can also provide advice to the user through voice feedback.
[0192] For example, if the system detects the emotion of "anxiety" during voice training, the server will provide appropriate feedback corresponding to that emotion. By using a prompt such as, "Analyze my emotions in response to my reading and suggest feedback and exercises," the system can develop feedback tailored to the user. In this way, voice training incorporating emotion analysis can provide a more comprehensive and effective learning experience.
[0193] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0194] Step 1:
[0195] The device captures the user's voice using a microphone. The input data is an audio signal. The audio is converted to a digital format and saved as an audio file. This data, along with the user's identification information, is transmitted to the server in a secure manner.
[0196] Step 2:
[0197] The server securely stores the received audio data and analyzes it using an audio analysis library (e.g., librosa). The input data is a digital audio file sent from the terminal. Numerical information such as audio features, pitch, rhythm, and volume is extracted, and an output is generated that evaluates the audio capability.
[0198] Step 3:
[0199] The server uses an emotion recognition engine (e.g., IBM Watson Emotion Analysis) to determine the user's emotional state from the audio data. The input data consists of audio features. The engine analyzes the intonation and rhythm patterns in the audio and outputs the emotional state as "joy" or "anxiety."
[0200] Step 4:
[0201] The server generates an individualized voice training plan based on voice evaluation and emotional state. This plan is optimized by referencing past data. Input data includes voice ability evaluation and emotional state. Output is a training plan that includes specific exercise procedures.
[0202] Step 5:
[0203] The generated voice training plan and feedback are sent to the device. The device's user interface visualizes and presents this information to the user. Feedback may also be provided via voice. The input data is the generated training plan. The output is visual or auditory information in a format easily understood by the user.
[0204] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.
[0205] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include those described above. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions shown by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0206] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.
[0207] [Second Embodiment]
[0208] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.
[0209] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.
[0210] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0211] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.
[0212] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0213] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0214] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0215] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0216] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0217] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0218] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0219] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0220] This invention is a system that acquires voice data and provides the user with optimal voice training based on the analysis results. In implementation, the system mainly consists of a terminal for collecting voice data, a server for analysis, and the user.
[0221] First, the user records audio using their device. The device is an electronic device such as a smartphone or tablet, and the user launches a recording application. The user confirms that the microphone is properly connected to the device, presses the record button, and sings a portion of the designated song.
[0222] Once recording is complete, the device sends the audio data to the server. The audio data is converted to a digital format and securely sent to the server via the internet along with the user's identification information. The server receives this data and begins its analysis.
[0223] The server uses machine learning algorithms to analyze the received audio data. The analysis primarily measures features such as pitch, rhythm, volume, and tone quality. The server comprehensively evaluates these features and generates a numerical singing ability score.
[0224] The server then consults a database based on the evaluation score to create an optimal voice training plan for the user. For example, if the user's pitch is evaluated as unstable, the server can generate a training plan that includes scale exercises.
[0225] Finally, the server sends back the generated voice training plan and evaluation results to the terminal. The terminal visualizes this information on its user interface. The user can then review this information and use it to improve their daily voice training.
[0226] This system provides users with a personalized and efficient singing improvement program. For example, to improve pitch accuracy, the server might suggest that the user repeatedly practice specific scales or melodic patterns. Thus, this system becomes a useful tool for all users who want to improve their singing skills.
[0227] The following describes the processing flow.
[0228] Step 1:
[0229] The user launches a recording application on their device and configures the audio input device appropriately. When the user presses the record button, recording begins, and audio data is collected via the device's microphone.
[0230] Step 2:
[0231] The device formats the recorded audio as a digital audio file and temporarily stores the audio data in memory. After recording is complete, the user can stop recording by pressing the stop recording button.
[0232] Step 3:
[0233] The device packets the stored voice data and associated user identification information and sends them to the server using a secure protocol (e.g., HTTPS).
[0234] Step 4:
[0235] The server receives the audio data, decodes it, and verifies its integrity. If there are no problems, the audio data is input into a machine learning algorithm to begin analysis.
[0236] Step 5:
[0237] The server performs audio analysis and extracts audio features such as pitch, rhythm, and volume. Based on this, it evaluates the user's singing ability and generates a numerical score.
[0238] Step 6:
[0239] The server consults a database and creates a voice training plan recommended to the user based on their evaluation score. This includes specific instructions for scale practice and improving expressiveness.
[0240] Step 7:
[0241] The server sends the generated evaluation results and training plan to the terminal.
[0242] Step 8:
[0243] The device displays the received data on the user interface, allowing the user to review training results and suggestions. Based on this, the user proceeds with their daily practice.
[0244] (Example 1)
[0245] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0246] Providing an effective voice training plan to improve individual singing abilities is challenging. Existing technologies rely heavily on subjective evaluation of voice information and uniform training plans, resulting in insufficient feedback tailored to individual user characteristics. Furthermore, developing such plans is time-consuming and labor-intensive, hindering efficient improvement of singing abilities.
[0247] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0248] In this invention, the server includes means for analyzing voice information and evaluating singing ability, means for generating an individual voice training plan based on the evaluation results, and means for providing the generated voice training plan to the user. This makes it possible to provide the user with objective, efficient, and individually optimized voice training.
[0249] "Audio information" refers to digital or analog data recorded in a form that includes the frequency, intensity, and temporal changes of sound.
[0250] The term "device" refers to a combination of rigid and flexible elements used to achieve a specific function or operation.
[0251] An "analysis device" refers to computing resources used to process input data and extract or analyze specific information.
[0252] "Singing ability" is an evaluation of the ability to sing with musically accurate pitch, rhythm, volume, and tone quality.
[0253] An "evaluation indicator" is a way of expressing a specific ability or performance using numerical values or a scale.
[0254] A "voice training plan" refers to a plan that includes specific practice content and steps to improve the characteristics of the subject's voice.
[0255] "User" refers to an individual or group that operates this system and is eligible to receive a training plan.
[0256] An "information storage device" refers to a medium or device that stores data and information and allows it to be retrieved as needed.
[0257] To implement this invention, the user first records audio information using a device such as a smartphone or tablet. A common recording application is installed on the device, and the microphone is configured to function correctly. The user presses the record button and sings a portion of a designated song to collect audio information. Once the recording is complete, the device converts the audio information into a digital format and sends it to a server via the internet. Encryption technology is used to ensure the security of the data during transmission.
[0258] The server can utilize speech analysis software to analyze the received audio information. Specifically, it uses the Python LibROSA library and machine learning algorithms to measure characteristics such as pitch, rhythm, volume, and sound quality, and generates evaluation metrics. These evaluation metrics are compared to standard data by a generative AI model, and the user's singing ability is quantified.
[0259] The server generates an individualized voice training plan based on the analysis results. This plan is created using various training methods and historical training data stored in the database. By referring to historical data, an optimized plan tailored to the user's characteristics is provided. For example, if the server determines that the user's pitch is unstable, it can provide a training plan that includes pitch practice and melody pattern practice.
[0260] Finally, the server sends the generated voice training plan and evaluation results back to the terminal. The terminal displays this information in its user interface, allowing the user to refer to the results and training plan. The user can then perform daily voice training based on the displayed information.
[0261] As a concrete example, a prompt message to be input to the generating AI model could be something like, "Please calculate an evaluation score using the user's voice data and generate a training plan." This system allows users to efficiently improve their singing ability.
[0262] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0263] Step 1:
[0264] The user records audio information using the device. After the user launches the recording application and confirms that the microphone is securely connected, they press the record button and sing the designated song. This input is the user's raw voice data. When recording is complete, the device converts the recorded analog audio into a digital format. The output is the digitized audio data.
[0265] Step 2:
[0266] The terminal sends digital audio data to the server. Before transmission, the audio data is encrypted along with the user's identification information. The input is digitized audio data, and the output is encrypted data transmitted to the server via the internet.
[0267] Step 3:
[0268] The server receives audio data and begins analysis. The input is encrypted audio data, which is first decrypted. The decrypted data is fed into a machine learning algorithm to extract features such as pitch, rhythm, volume, and sound quality. Audio processing libraries such as LibROSA can be used in this analysis process. The output is the feature vectors extracted from the audio data.
[0269] Step 4:
[0270] The server generates evaluation metrics based on extracted features. The input consists of features from audio data, and a generative AI model is used to compare these features with standard data, thereby quantifying the user's singing ability. The output is a numerical value representing the evaluation metric.
[0271] Step 5:
[0272] The server generates individualized voice training plans based on evaluation metrics and by referencing a database. Inputs include evaluation metrics and user characteristic data. The server selects the optimal training menu from historical data and training samples to create an individualized plan. The output is a voice training plan optimized for the user.
[0273] Step 6:
[0274] The server sends the generated voice training plan and evaluation metrics back to the terminal. The input is the voice training plan and evaluation metrics, which are retransmitted to the terminal via the internet. The output is the training plan and evaluation results sent to the terminal.
[0275] Step 7:
[0276] The terminal displays the received information on the user interface. The input consists of a voice training plan and evaluation metrics, which are visualized on the screen in a user-friendly format. Users can use this output to improve their daily voice training.
[0277] (Application Example 1)
[0278] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0279] In recent years, there has been a growing demand for entertainment and learning support using home-use electronic devices. However, a challenge remains: there is a lack of effective means to improve users' singing abilities while providing individualized feedback. Conventional systems offer limited evaluation and feedback, making it difficult for users to efficiently improve their singing skills.
[0280] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0281] In this invention, the server includes means that is a device for acquiring voice information, means that transmit the acquired voice information to an information processing device for analysis, and means that provide feedback that allows the user to improve their singing ability while interacting with the home-use device. This makes it possible for the user to receive individually customized feedback in real time within their home.
[0282] "Audio information" refers to data related to the voice and singing performances emitted by users.
[0283] "Device" refers to a physical device that has the function of acquiring and transmitting audio information.
[0284] The "information processing device" refers to a system that analyzes the received voice information and performs computational processing to generate appropriate metrics and feedback.
[0285] "Analysis" refers to the process of extracting features and generating evaluation metrics using specific machine learning models or algorithms based on voice information.
[0286] "Evaluation metric" refers to a numerical or qualitative criterion regarding the singing ability and characteristics of the user calculated based on the analyzed voice information.
[0287] "Singing ability" indicates the ability and characteristics related to the user's singing, including pitch, rhythm, and timbre.
[0288] "Feedback" refers to information that includes specific points for improvement and observations provided to the user based on the analysis results.
[0289] "Household mechanical device" refers to robots and electronic devices used in a household environment and are utilized for entertainment and learning support.
[0290] The system for implementing this invention has a series of processes for acquiring voice information, performing analysis, and providing feedback to the user. First, the device used by the user has a microphone and the function of acquiring voice information. The acquired voice information is transmitted to a server, which is an information processing device, via the Internet. At this time, the data is transmitted securely according to the security protocol.
[0291] The server uses machine learning algorithms for the analysis of voice information. Specifically, machine learning tools such as TensorFlow are used to identify features of the voice such as pitch, rhythm, volume, and timbre. The server analyzes these features and generates evaluation metrics. Based on the evaluation results, an individual singing training plan is generated. Data from the past information storage device is also referred to in this process.
[0292] The generated feedback and training plans are then returned to the user's device via the internet. Users can interact with the home-use device and use the generated feedback to improve their singing ability. This system enables real-time feedback; for example, when a user sings a particular song, they might receive specific feedback such as, "Your pitch is slightly off. Try singing it a little slower."
[0293] By using a generative AI model for analysis, users can receive personalized feedback in real time. As an example of this process, the prompt message to the generative AI model would be, "Analyze the following audio data to evaluate the user's singing ability and generate and present an appropriate training plan."
[0294] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0295] Step 1:
[0296] The device acquires the user's voice information through the microphone. The input is the user's singing data, and the output is an audio file converted into a digital format. The device sends this audio file to the server via the internet.
[0297] Step 2:
[0298] The server first inputs the received audio file into a machine learning algorithm to analyze it. The input is the audio file, and the output is audio features (pitch, rhythm, volume, sound quality, etc.). The server uses TensorFlow to extract audio features and generate evaluation metrics.
[0299] Step 3:
[0300] Based on the generated evaluation metrics, the server creates a personalized singing training plan for each user. The input is the evaluation metrics obtained through voice analysis, and the output is the details of the training plan. The server optimizes the training plan by referring to the past database.
[0301] Step 4:
[0302] The server sends the generated singing training plan and evaluation metrics to the terminal. The input is the training plan and evaluation metrics, and the output is the transmission completion status to the terminal. The server sends the data to the terminal using a secure protocol.
[0303] Step 5:
[0304] The terminal visualizes the received singing training plan and evaluation metrics on the user interface. The input is the data received from the server, and the output is the feedback screen that the user sees. The terminal provides it to the user in an easy-to-understand graphical form.
[0305] Step 6:
[0306] The user practices improving their singing ability based on the feedback while interacting with the home mechanical device. The input is the feedback and training plan on the terminal, and the output is the improvement of the user's singing ability. The user can adjust the pitch according to the instructions or repeat the rhythm practice.
[0307] Furthermore, an emotion engine for estimating the user's emotion may be combined. That is, the specific processing unit 290 may estimate the user's emotion using the emotion recognition model 59 and perform specific processing using the user's emotion.
[0308] The present invention is a system that combines an emotion engine for recognizing the user's emotion with a system for improving the user's singing ability. This system is mainly composed of a terminal, a server, and an emotion engine.
[0309] In use, the user first launches the application on their device and performs voice recording. The device captures the user's singing voice through the microphone and converts the audio data into a digital format. This data, along with the user's identification information, is transmitted to the server in a secure manner.
[0310] The server receives audio data using a secure protocol, decodes it, and verifies its integrity. Next, it uses machine learning algorithms and an emotion engine to analyze the received audio data. Here, the machine learning algorithms extract audio features such as pitch, rhythm, and volume, and quantify singing ability. Simultaneously, the emotion engine analyzes the emotions in the audio and recognizes the user's emotional state.
[0311] Once the analysis is complete, the server considers the features and evaluation scores obtained from the audio, as well as the user's emotional state, to design a personalized voice training plan. For example, if it detects that the user is nervous, it can suggest feedback and practice exercises to help them relax.
[0312] Next, the server sends the generated voice training plan and emotion-based feedback to the device. The device visualizes this information on the user interface and provides it to the user. The user can then carry out their daily training based on the suggestions they receive.
[0313] For example, if the emotion engine detects "anxiety" during voice training, the server will not only adjust the voice but also suggest a relaxation time, programming the system to help the user reduce stress. By considering the user's emotional state in this way, more effective improvement in singing ability can be expected. This invention is technically easy to implement and provides an efficient training environment by flexibly responding to the individual needs of the user.
[0314] The following describes the processing flow.
[0315] Step 1:
[0316] The user launches the application on their device and prepares to record. By pressing the record button, the device uses the microphone to record the user's singing voice and saves it as digital audio data.
[0317] Step 2:
[0318] The device converts the recorded audio data into a digital format and sends it to the server along with the user's identification information. The transmission is performed using a secure protocol.
[0319] Step 3:
[0320] The server verifies the integrity of the received audio data and analyzes it using a machine learning algorithm. This analysis extracts audio features (pitch, rhythm, volume) and generates a singing ability score.
[0321] Step 4:
[0322] The server uses an emotion engine to analyze emotional information within the voice data and identify the user's emotional state. For example, it determines emotions from factors such as voice tone and tempo.
[0323] Step 5:
[0324] The server combines voice characteristics and user emotions to design a personalized voice training plan. For example, it includes practice exercises tailored to a user who needs to relax.
[0325] Step 6:
[0326] The server sends the generated training plan and feedback to the user's device. Sentiment-based messages are also included as needed.
[0327] Step 7:
[0328] The device displays a training plan and feedback to the user through its user interface. The user follows the instructions received and begins their daily training.
[0329] Step 8:
[0330] The user continues practicing and returns to the cycle of recording again. The system facilitates continuous improvement through this cycle.
[0331] (Example 2)
[0332] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0333] Conventional voice training systems only quantify and evaluate a user's voice ability, making it difficult to provide training plans that take into account the user's emotional state. As a result, emotions such as tension and anxiety can affect singing ability, leading to challenges in providing effective training. Furthermore, generating training plans that meet the individual needs of each user requires more flexible and detailed analysis.
[0334] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0335] In this invention, the server includes means for analyzing voice information and evaluating voice ability, means for recognizing emotional states in the voice information using an emotion analysis device, and means for generating individual voice training plans based on the analysis results. This makes it possible to provide more effective and personalized training plans that also take into account the user's emotional state.
[0336] "Audio information" refers to data that represents speech in digital format, and is the subject of capturing and analyzing the user's voice.
[0337] "Device" refers to equipment or devices used to acquire audio information, such as microphones and terminals equipped with recording functions.
[0338] A "central processing unit" refers to a system with computing power, such as a server or cloud computer, that centrally analyzes and processes audio information.
[0339] "Vocal ability" refers to characteristics that indicate a user's singing ability and vocalization ability, and is an evaluation index that is quantified through the analysis of voice information.
[0340] An "emotion analysis device" refers to a system equipped with a mechanism or algorithm to recognize emotional information from voice information, and is a technology for analyzing a user's emotional state.
[0341] A "voice training plan" is a training program or instructional guideline designed to improve a user's voice abilities, generated based on their analyzed voice abilities and emotional state.
[0342] An "information storage system" refers to a database system that stores past data and user history, and allows users to access that data as needed.
[0343] This invention is a voice training system aimed at improving a user's voice abilities, and is characterized by its inclusion of an emotion analysis function. This system mainly consists of a terminal, a central processing unit, and an emotion analysis device, and provides the user with an individualized voice training plan through the acquisition, analysis, and provision of feedback of voice information.
[0344] The user first launches an application on their device and records their voice using the microphone. The device then converts the acquired voice into a digital format and transmits it to a central processing unit using a secure protocol. The central processing unit utilizes machine learning algorithms to analyze the received voice information and evaluate the user's voice abilities. It also recognizes emotions in the voice using an emotion analyzer and incorporates this into the training plan.
[0345] Specifically, the voice training plan generated by the central processing unit is adjusted based on the user's voice characteristics and emotional state. For example, if the system recognizes that the user is feeling anxious during recording, the training plan may include exercises to promote relaxation. Furthermore, the analysis results are quantified as evaluation metrics, allowing the user to track their progress.
[0346] This system is expected to enable a more personalized approach to users and improve their voice abilities more effectively compared to conventional technologies. An example of a prompt generated by the AI model is, "Analyze the singing voice recorded by the user and generate an optimal training plan based on the voice characteristics and emotional state." This allows users to receive scientifically-based and effective training.
[0347] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0348] Step 1:
[0349] The user launches an application on the device and starts recording voice. The user records their voice through the microphone. The device converts the voice into a digital format and prepares it as voice information along with the user's identification information. This input consists of the user's voice data and identification information, which forms the basis for transmission to the central processing unit.
[0350] Step 2:
[0351] The terminal sends the prepared audio information to the server using a secure protocol (e.g., HTTPS). The transmitted audio information becomes the server's input. The server decodes the received audio information and verifies the integrity of the data. This process checks the integrity of the audio data and prepares it for analysis.
[0352] Step 3:
[0353] The server analyzes the audio information. The input is the audio data decoded in step 2. Machine learning algorithms are used for the analysis to extract audio features (pitch, rhythm, volume, etc.). This data processing quantifies the voice ability and generates evaluation metrics. Simultaneously, an emotion analysis device operates to identify the emotional states contained in the audio. The output of this process is the voice evaluation score and emotional state data.
[0354] Step 4:
[0355] The server generates individualized voice training plans based on speech features and emotional states. The input is the evaluation score and emotional state obtained in step 3, and a generative AI model is used to create prompt sentences based on this. Specifically, the plan includes concrete instruction such as "incorporate relaxation methods to alleviate user tension into the training plan." The output of this step is the individualized voice training plan.
[0356] Step 5:
[0357] The server sends the created voice training plan to the terminal. The terminal decodes the received plan and displays it in the user interface. Based on the displayed plan, the user can perform their daily training. The output of this final step is the voice training plan and feedback that the user receives.
[0358] (Application Example 2)
[0359] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0360] In recent years, training systems aimed at improving users' voice abilities have become widespread. However, these systems primarily focus on evaluating abilities based on the characteristics of voice data and do not provide training that takes into account the user's inner emotional state. Therefore, there is a need to develop a system that accurately grasps the user's emotional state and effectively improves voice abilities through feedback based on that understanding.
[0361] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0362] In this invention, the server includes means for analyzing voice data and evaluating voice ability, means for analyzing the user's emotional state from the acquired voice data, and means for generating emotion-appropriate feedback based on the user's emotional state. This enables personalized voice training that takes the user's emotions into consideration.
[0363] "Audio data" refers to data obtained by digitizing an audio signal and converting it into a format that can be processed.
[0364] A "device" is hardware or a combination of hardware designed to perform a specific function.
[0365] A "central processing unit" is a computer system or server used to receive, analyze, and process data.
[0366] "Vocal ability" refers to a numerical or evaluative representation of vocal expressiveness and technique.
[0367] A "voice training plan" is a plan that combines a series of practice methods and activities to improve voice ability.
[0368] A "user" is an individual or agent who uses the system.
[0369] "Emotional state" refers to the user's internal psychological condition, analyzed based on voice data.
[0370] "Feedback" refers to information or advice provided to the user based on the analysis results.
[0371] An "information collection" is a set of data, including past data and history, that is used for analysis and optimization.
[0372] The server plays a central role in this system, analyzing the user's voice data. The user acquires their own voice using a device equipped with voice recording capabilities. For example, a household robot can function as this device. The voice data is transmitted from the terminal to the central processing unit, where it is stored as digital data.
[0373] The server uses a Python-based speech analysis library (e.g., librosa) to extract features from the speech data. Speech ability is then numerically evaluated by a machine learning model based on these features. Furthermore, an emotion analysis engine (e.g., IBM Watson Emotion Analysis) is used to understand the emotional state from the acquired speech. This process reveals the user's inner emotions.
[0374] Next, the server generates a voice training plan that takes into account the user's emotional state and voice evaluation. For example, if the user is identified as anxious, simple breathing exercises to help them relax are added to the training plan. These plans are further refined by leveraging optimization algorithms based on underlying data and by comparing them with past data sets.
[0375] The generated training plan and feedback are then sent to the device and visualized through an interactive user interface. The robot can also provide advice to the user through voice feedback.
[0376] For example, if the system detects the emotion of "anxiety" during voice training, the server will provide appropriate feedback corresponding to that emotion. By using a prompt such as, "Analyze my emotions in response to my reading and suggest feedback and exercises," the system can develop feedback tailored to the user. In this way, voice training incorporating emotion analysis can provide a more comprehensive and effective learning experience.
[0377] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0378] Step 1:
[0379] The device captures the user's voice using a microphone. The input data is an audio signal. The audio is converted to a digital format and saved as an audio file. This data, along with the user's identification information, is transmitted to the server in a secure manner.
[0380] Step 2:
[0381] The server securely stores the received audio data and analyzes it using an audio analysis library (e.g., librosa). The input data is a digital audio file sent from the terminal. Numerical information such as audio features, pitch, rhythm, and volume is extracted, and an output is generated that evaluates the audio capability.
[0382] Step 3:
[0383] The server uses an emotion recognition engine (e.g., IBM Watson Emotion Analysis) to determine the user's emotional state from the audio data. The input data consists of audio features. The engine analyzes the intonation and rhythm patterns in the audio and outputs the emotional state as "joy" or "anxiety."
[0384] Step 4:
[0385] The server generates an individualized voice training plan based on voice evaluation and emotional state. This plan is optimized by referencing past data. Input data includes voice ability evaluation and emotional state. Output is a training plan that includes specific exercise procedures.
[0386] Step 5:
[0387] The generated voice training plan and feedback are sent to the device. The device's user interface visualizes and presents this information to the user. Feedback may also be provided via voice. The input data is the generated training plan. The output is visual or auditory information in a format easily understood by the user.
[0388] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0389] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include those described above. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions shown by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0390] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.
[0391] [Third Embodiment]
[0392] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.
[0393] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.
[0394] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0395] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.
[0396] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0397] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0398] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0399] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0400] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0401] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0402] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0403] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".
[0404] This invention is a system that acquires voice data and provides the user with optimal voice training based on the analysis results. In implementation, the system mainly consists of a terminal for collecting voice data, a server for analysis, and the user.
[0405] First, the user records audio using their device. The device is an electronic device such as a smartphone or tablet, and the user launches a recording application. The user confirms that the microphone is properly connected to the device, presses the record button, and sings a portion of the designated song.
[0406] Once recording is complete, the device sends the audio data to the server. The audio data is converted to a digital format and securely sent to the server via the internet along with the user's identification information. The server receives this data and begins its analysis.
[0407] The server uses machine learning algorithms to analyze the received audio data. The analysis primarily measures features such as pitch, rhythm, volume, and tone quality. The server comprehensively evaluates these features and generates a numerical singing ability score.
[0408] The server then consults a database based on the evaluation score to create an optimal voice training plan for the user. For example, if the user's pitch is evaluated as unstable, the server can generate a training plan that includes scale exercises.
[0409] Finally, the server sends back the generated voice training plan and evaluation results to the terminal. The terminal visualizes this information on its user interface. The user can then review this information and use it to improve their daily voice training.
[0410] This system provides users with a personalized and efficient singing improvement program. For example, to improve pitch accuracy, the server might suggest that the user repeatedly practice specific scales or melodic patterns. Thus, this system becomes a useful tool for all users who want to improve their singing skills.
[0411] The following describes the processing flow.
[0412] Step 1:
[0413] The user launches a recording application on their device and configures the audio input device appropriately. When the user presses the record button, recording begins, and audio data is collected via the device's microphone.
[0414] Step 2:
[0415] The device formats the recorded audio as a digital audio file and temporarily stores the audio data in memory. After recording is complete, the user can stop recording by pressing the stop recording button.
[0416] Step 3:
[0417] The device packets the stored voice data and associated user identification information and sends them to the server using a secure protocol (e.g., HTTPS).
[0418] Step 4:
[0419] The server receives the audio data, decodes it, and verifies its integrity. If there are no problems, the audio data is input into a machine learning algorithm to begin analysis.
[0420] Step 5:
[0421] The server performs audio analysis and extracts audio features such as pitch, rhythm, and volume. Based on this, it evaluates the user's singing ability and generates a numerical score.
[0422] Step 6:
[0423] The server consults a database and creates a voice training plan recommended to the user based on their evaluation score. This includes specific instructions for scale practice and improving expressiveness.
[0424] Step 7:
[0425] The server sends the generated evaluation results and training plan to the terminal.
[0426] Step 8:
[0427] The device displays the received data on the user interface, allowing the user to review training results and suggestions. Based on this, the user proceeds with their daily practice.
[0428] (Example 1)
[0429] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0430] Providing an effective voice training plan to improve individual singing abilities is challenging. Existing technologies rely heavily on subjective evaluation of voice information and uniform training plans, resulting in insufficient feedback tailored to individual user characteristics. Furthermore, developing such plans is time-consuming and labor-intensive, hindering efficient improvement of singing abilities.
[0431] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0432] In this invention, the server includes means for analyzing voice information and evaluating singing ability, means for generating an individual voice training plan based on the evaluation results, and means for providing the generated voice training plan to the user. This makes it possible to provide the user with objective, efficient, and individually optimized voice training.
[0433] "Audio information" refers to digital or analog data recorded in a form that includes the frequency, intensity, and temporal changes of sound.
[0434] The term "device" refers to a combination of rigid and flexible elements used to achieve a specific function or operation.
[0435] An "analysis device" refers to computing resources used to process input data and extract or analyze specific information.
[0436] "Singing ability" is an evaluation of the ability to sing with musically accurate pitch, rhythm, volume, and tone quality.
[0437] An "evaluation indicator" is a way of expressing a specific ability or performance using numerical values or a scale.
[0438] A "voice training plan" refers to a plan that includes specific practice content and steps to improve the characteristics of the subject's voice.
[0439] "User" refers to an individual or group that operates this system and is eligible to receive a training plan.
[0440] An "information storage device" refers to a medium or device that stores data and information and allows it to be retrieved as needed.
[0441] To implement this invention, the user first records audio information using a device such as a smartphone or tablet. A common recording application is installed on the device, and the microphone is configured to function correctly. The user presses the record button and sings a portion of a designated song to collect audio information. Once the recording is complete, the device converts the audio information into a digital format and sends it to a server via the internet. Encryption technology is used to ensure the security of the data during transmission.
[0442] The server can utilize speech analysis software to analyze the received audio information. Specifically, it uses the Python LibROSA library and machine learning algorithms to measure characteristics such as pitch, rhythm, volume, and sound quality, and generates evaluation metrics. These evaluation metrics are compared to standard data by a generative AI model, and the user's singing ability is quantified.
[0443] The server generates an individualized voice training plan based on the analysis results. This plan is created using various training methods and historical training data stored in the database. By referring to historical data, an optimized plan tailored to the user's characteristics is provided. For example, if the server determines that the user's pitch is unstable, it can provide a training plan that includes pitch practice and melody pattern practice.
[0444] Finally, the server sends the generated voice training plan and evaluation results back to the terminal. The terminal displays this information in its user interface, allowing the user to refer to the results and training plan. The user can then perform daily voice training based on the displayed information.
[0445] As a concrete example, a prompt message to be input to the generating AI model could be something like, "Please calculate an evaluation score using the user's voice data and generate a training plan." This system allows users to efficiently improve their singing ability.
[0446] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0447] Step 1:
[0448] The user records audio information using the device. After the user launches the recording application and confirms that the microphone is securely connected, they press the record button and sing the designated song. This input is the user's raw voice data. When recording is complete, the device converts the recorded analog audio into a digital format. The output is the digitized audio data.
[0449] Step 2:
[0450] The terminal sends digital audio data to the server. Before transmission, the audio data is encrypted along with the user's identification information. The input is digitized audio data, and the output is encrypted data transmitted to the server via the internet.
[0451] Step 3:
[0452] The server receives audio data and begins analysis. The input is encrypted audio data, which is first decrypted. The decrypted data is fed into a machine learning algorithm to extract features such as pitch, rhythm, volume, and sound quality. Audio processing libraries such as LibROSA can be used in this analysis process. The output is the feature vectors extracted from the audio data.
[0453] Step 4:
[0454] The server generates evaluation metrics based on extracted features. The input consists of features from audio data, and a generative AI model is used to compare these features with standard data, thereby quantifying the user's singing ability. The output is a numerical value representing the evaluation metric.
[0455] Step 5:
[0456] The server generates individualized voice training plans based on evaluation metrics and by referencing a database. Inputs include evaluation metrics and user characteristic data. The server selects the optimal training menu from historical data and training samples to create an individualized plan. The output is a voice training plan optimized for the user.
[0457] Step 6:
[0458] The server sends the generated voice training plan and evaluation metrics back to the terminal. The input is the voice training plan and evaluation metrics, which are retransmitted to the terminal via the internet. The output is the training plan and evaluation results sent to the terminal.
[0459] Step 7:
[0460] The terminal displays the received information on the user interface. The input consists of a voice training plan and evaluation metrics, which are visualized on the screen in a user-friendly format. Users can use this output to improve their daily voice training.
[0461] (Application Example 1)
[0462] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0463] In recent years, there has been a growing demand for entertainment and learning support using home-use electronic devices. However, a challenge remains: there is a lack of effective means to improve users' singing abilities while providing individualized feedback. Conventional systems offer limited evaluation and feedback, making it difficult for users to efficiently improve their singing skills.
[0464] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0465] In this invention, the server includes means that is a device for acquiring voice information, means that transmit the acquired voice information to an information processing device for analysis, and means that provide feedback that allows the user to improve their singing ability while interacting with the home-use device. This makes it possible for the user to receive individually customized feedback in real time within their home.
[0466] "Audio information" refers to data related to the voice and singing performances emitted by users.
[0467] "Device" refers to a physical device that has the function of acquiring and transmitting audio information.
[0468] An "information processing device" refers to a system that analyzes received audio information and performs calculations to generate appropriate indicators and feedback.
[0469] "Analysis" refers to the process of extracting features from audio information using specific machine learning models or algorithms and generating evaluation metrics.
[0470] "Evaluation metrics" refer to numerical or qualitative criteria related to a user's singing ability and characteristics, calculated based on analyzed audio information.
[0471] "Singing ability" refers to the user's singing ability and characteristics, including pitch, rhythm, and tone quality.
[0472] "Feedback" refers to information that includes specific areas for improvement and points of concern, provided to users based on the analyzed results.
[0473] "Household machinery and devices" refers to robots and electronic devices used in a home environment, such as for entertainment or learning support.
[0474] The system for implementing this invention has a series of processes for acquiring voice information, analyzing it, and providing feedback to the user. First, the device used by the user is equipped with a microphone and has the function of acquiring voice information. The acquired voice information is transmitted via the internet to a server, which is an information processing device. In this process, the data is transmitted securely using security protocols.
[0475] The server uses machine learning algorithms to analyze audio information. Specifically, machine learning tools such as TensorFlow are used to identify audio features such as pitch, rhythm, volume, and tone quality. The server analyzes these features and generates evaluation metrics. Based on the evaluation results, individual singing training plans are generated. This process also references data from historical data storage devices.
[0476] The generated feedback and training plans are then returned to the user's device via the internet. Users can interact with the home-use device and use the generated feedback to improve their singing ability. This system enables real-time feedback; for example, when a user sings a particular song, they might receive specific feedback such as, "Your pitch is slightly off. Try singing it a little slower."
[0477] By using a generative AI model for analysis, users can receive personalized feedback in real time. As an example of this process, the prompt message to the generative AI model would be, "Analyze the following audio data to evaluate the user's singing ability and generate and present an appropriate training plan."
[0478] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0479] Step 1:
[0480] The device acquires the user's voice information through the microphone. The input is the user's singing data, and the output is an audio file converted into a digital format. The device sends this audio file to the server via the internet.
[0481] Step 2:
[0482] The server first inputs the received audio file into a machine learning algorithm to analyze it. The input is the audio file, and the output is audio features (pitch, rhythm, volume, sound quality, etc.). The server uses TensorFlow to extract audio features and generate evaluation metrics.
[0483] Step 3:
[0484] The server creates an individualized singing training plan for each user based on the generated evaluation metrics. The input is the evaluation metrics obtained through voice analysis, and the output is the details of the training plan. The server optimizes the training plan by referring to past database data.
[0485] Step 4:
[0486] The server sends the generated singing training plan and evaluation metrics to the terminal. The input is the training plan and evaluation metrics, and the output is the status of successful transmission to the terminal. The server sends the data to the terminal using a secure protocol.
[0487] Step 5:
[0488] The terminal visualizes the received singing training plan and evaluation metrics on the user interface. The input is data received from the server, and the output is the feedback screen the user sees. The terminal provides this information to the user in an easy-to-understand graphical format.
[0489] Step 6:
[0490] Users practice improving their singing ability based on feedback while interacting with a home-use device. The input is feedback and training plans on the device, and the output is the user's improved singing ability. Users can adjust pitch and repeat rhythm exercises according to instructions.
[0491] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0492] This invention is a system that combines a system for improving a user's singing ability with an emotion engine that recognizes the user's emotions. This system mainly consists of a terminal, a server, and an emotion engine.
[0493] In use, the user first launches the application on their device and performs voice recording. The device captures the user's singing voice through the microphone and converts the audio data into a digital format. This data, along with the user's identification information, is transmitted to the server in a secure manner.
[0494] The server receives audio data using a secure protocol, decodes it, and verifies its integrity. Next, it uses machine learning algorithms and an emotion engine to analyze the received audio data. Here, the machine learning algorithms extract audio features such as pitch, rhythm, and volume, and quantify singing ability. Simultaneously, the emotion engine analyzes the emotions in the audio and recognizes the user's emotional state.
[0495] Once the analysis is complete, the server considers the features and evaluation scores obtained from the audio, as well as the user's emotional state, to design a personalized voice training plan. For example, if it detects that the user is nervous, it can suggest feedback and practice exercises to help them relax.
[0496] Next, the server sends the generated voice training plan and emotion-based feedback to the device. The device visualizes this information on the user interface and provides it to the user. The user can then carry out their daily training based on the suggestions they receive.
[0497] For example, if the emotion engine detects "anxiety" during voice training, the server will not only adjust the voice but also suggest a relaxation time, programming the system to help the user reduce stress. By considering the user's emotional state in this way, more effective improvement in singing ability can be expected. This invention is technically easy to implement and provides an efficient training environment by flexibly responding to the individual needs of the user.
[0498] The following describes the processing flow.
[0499] Step 1:
[0500] The user launches the application on their device and prepares to record. By pressing the record button, the device uses the microphone to record the user's singing voice and saves it as digital audio data.
[0501] Step 2:
[0502] The device converts the recorded audio data into a digital format and sends it to the server along with the user's identification information. The transmission is performed using a secure protocol.
[0503] Step 3:
[0504] The server verifies the integrity of the received audio data and analyzes it using a machine learning algorithm. This analysis extracts audio features (pitch, rhythm, volume) and generates a singing ability score.
[0505] Step 4:
[0506] The server uses an emotion engine to analyze emotional information within the voice data and identify the user's emotional state. For example, it determines emotions from factors such as voice tone and tempo.
[0507] Step 5:
[0508] The server combines voice characteristics with the user's emotions to design a personalized voice training plan. For example, for a user who needs to relax, the training menu will include exercises that take that into consideration.
[0509] Step 6:
[0510] The server sends the generated training plan and feedback to the user's device. Sentiment-based messages are also included as needed.
[0511] Step 7:
[0512] The device displays a training plan and feedback to the user through its user interface. The user follows the instructions received and begins their daily training.
[0513] Step 8:
[0514] The user continues practicing and returns to the cycle of recording again. The system facilitates continuous improvement through this cycle.
[0515] (Example 2)
[0516] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0517] Conventional voice training systems only quantify and evaluate a user's voice ability, making it difficult to provide training plans that take into account the user's emotional state. As a result, emotions such as tension and anxiety can affect singing ability, leading to challenges in providing effective training. Furthermore, generating training plans that meet the individual needs of each user requires more flexible and detailed analysis.
[0518] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0519] In this invention, the server includes means for analyzing voice information and evaluating voice ability, means for recognizing emotional states in the voice information using an emotion analysis device, and means for generating individual voice training plans based on the analysis results. This makes it possible to provide more effective and personalized training plans that also take into account the user's emotional state.
[0520] "Audio information" refers to data that represents speech in digital format, and is the subject of capturing and analyzing the user's voice.
[0521] "Device" refers to equipment or devices used to acquire audio information, such as microphones and terminals equipped with recording functions.
[0522] A "central processing unit" refers to a system with computing power, such as a server or cloud computer, that centrally analyzes and processes audio information.
[0523] "Vocal ability" refers to characteristics that indicate a user's singing ability and vocalization ability, and is an evaluation index that is quantified through the analysis of voice information.
[0524] An "emotion analysis device" refers to a system equipped with a mechanism or algorithm to recognize emotional information from voice information, and is a technology for analyzing a user's emotional state.
[0525] A "voice training plan" is a training program or instructional guideline designed to improve a user's voice abilities, generated based on their analyzed voice abilities and emotional state.
[0526] An "information storage system" refers to a database system that stores past data and user history, and allows users to access that data as needed.
[0527] This invention is a voice training system aimed at improving a user's voice abilities, and is characterized by its inclusion of an emotion analysis function. This system mainly consists of a terminal, a central processing unit, and an emotion analysis device, and provides the user with an individualized voice training plan through the acquisition, analysis, and provision of feedback of voice information.
[0528] The user first launches an application on their device and records their voice using the microphone. The device then converts the acquired voice into a digital format and transmits it to a central processing unit using a secure protocol. The central processing unit utilizes machine learning algorithms to analyze the received voice information and evaluate the user's voice abilities. It also recognizes emotions in the voice using an emotion analyzer and incorporates this into the training plan.
[0529] Specifically, the voice training plan generated by the central processing unit is adjusted based on the user's voice characteristics and emotional state. For example, if the system recognizes that the user is feeling anxious during recording, the training plan may include exercises to promote relaxation. Furthermore, the analysis results are quantified as evaluation metrics, allowing the user to track their progress.
[0530] This system is expected to enable a more personalized approach to users and improve their voice abilities more effectively compared to conventional technologies. An example of a prompt generated by the AI model is, "Analyze the singing voice recorded by the user and generate an optimal training plan based on the voice characteristics and emotional state." This allows users to receive scientifically-based and effective training.
[0531] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0532] Step 1:
[0533] The user launches an application on the device and starts recording voice. The user records their voice through the microphone. The device converts the voice into a digital format and prepares it as voice information along with the user's identification information. This input consists of the user's voice data and identification information, which forms the basis for transmission to the central processing unit.
[0534] Step 2:
[0535] The terminal sends the prepared audio information to the server using a secure protocol (e.g., HTTPS). The transmitted audio information becomes the server's input. The server decodes the received audio information and verifies the integrity of the data. This process checks the integrity of the audio data and prepares it for analysis.
[0536] Step 3:
[0537] The server analyzes the audio information. The input is the audio data decoded in step 2. Machine learning algorithms are used for the analysis to extract audio features (pitch, rhythm, volume, etc.). This data processing quantifies the voice ability and generates evaluation metrics. Simultaneously, an emotion analysis device operates to identify the emotional states contained in the audio. The output of this process is the voice evaluation score and emotional state data.
[0538] Step 4:
[0539] The server generates individualized voice training plans based on speech features and emotional states. The input is the evaluation score and emotional state obtained in step 3, and a generative AI model is used to create prompt sentences based on this. Specifically, the plan includes concrete instruction such as "incorporate relaxation methods to alleviate user tension into the training plan." The output of this step is the individualized voice training plan.
[0540] Step 5:
[0541] The server sends the created voice training plan to the terminal. The terminal decodes the received plan and displays it in the user interface. Based on the displayed plan, the user can perform their daily training. The output of this final step is the voice training plan and feedback that the user receives.
[0542] (Application Example 2)
[0543] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0544] In recent years, training systems aimed at improving users' voice abilities have become widespread. However, these systems primarily focus on evaluating abilities based on the characteristics of voice data and do not provide training that takes into account the user's inner emotional state. Therefore, there is a need to develop a system that accurately grasps the user's emotional state and effectively improves voice abilities through feedback based on that understanding.
[0545] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0546] In this invention, the server includes means for analyzing voice data and evaluating voice ability, means for analyzing the user's emotional state from the acquired voice data, and means for generating emotion-appropriate feedback based on the user's emotional state. This enables personalized voice training that takes the user's emotions into consideration.
[0547] "Audio data" refers to data obtained by digitizing an audio signal and converting it into a format that can be processed.
[0548] A "device" is hardware or a combination of hardware designed to perform a specific function.
[0549] A "central processing unit" is a computer system or server used to receive, analyze, and process data.
[0550] "Vocal ability" refers to a numerical or evaluative representation of vocal expressiveness and technique.
[0551] A "voice training plan" is a plan that combines a series of practice methods and activities to improve voice ability.
[0552] A "user" is an individual or agent who uses the system.
[0553] "Emotional state" refers to the user's internal psychological condition, as analyzed based on voice data.
[0554] "Feedback" refers to information or advice provided to the user based on the analysis results.
[0555] An "information collection" is a set of data, including past data and history, that is used for analysis and optimization.
[0556] The server plays a central role in this system, analyzing the user's voice data. The user acquires their own voice using a device equipped with voice recording capabilities. For example, a household robot can function as this device. The voice data is transmitted from the terminal to the central processing unit, where it is stored as digital data.
[0557] The server uses a Python-based speech analysis library (e.g., librosa) to extract features from the speech data. Speech ability is then numerically evaluated by a machine learning model based on these features. Furthermore, an emotion analysis engine (e.g., IBM Watson Emotion Analysis) is used to understand the emotional state from the acquired speech. This process reveals the user's inner emotions.
[0558] Next, the server generates a voice training plan that takes into account the user's emotional state and voice evaluation. For example, if the user is identified as anxious, simple breathing exercises to help them relax are added to the training plan. These plans are further refined by leveraging optimization algorithms based on underlying data and by comparing them with past data sets.
[0559] The generated training plan and feedback are then sent to the device and visualized through an interactive user interface. The robot can also provide advice to the user through voice feedback.
[0560] For example, if the system detects the emotion of "anxiety" during voice training, the server will provide appropriate feedback corresponding to that emotion. By using a prompt such as, "Analyze my emotions in response to my reading and suggest feedback and exercises," the system can develop feedback tailored to the user. In this way, voice training incorporating emotion analysis can provide a more comprehensive and effective learning experience.
[0561] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0562] Step 1:
[0563] The device captures the user's voice using a microphone. The input data is an audio signal. The audio is converted to a digital format and saved as an audio file. This data, along with the user's identification information, is transmitted to the server in a secure manner.
[0564] Step 2:
[0565] The server securely stores the received audio data and analyzes it using an audio analysis library (e.g., librosa). The input data is a digital audio file sent from the terminal. Numerical information such as audio features, pitch, rhythm, and volume is extracted, and an output is generated that evaluates the audio capability.
[0566] Step 3:
[0567] The server uses an emotion recognition engine (e.g., IBM Watson Emotion Analysis) to determine the user's emotional state from the audio data. The input data consists of audio features. The engine analyzes the intonation and rhythm patterns in the audio and outputs the emotional state as "joy" or "anxiety."
[0568] Step 4:
[0569] The server generates an individualized voice training plan based on voice evaluation and emotional state. This plan is optimized by referencing a database of past data. Input data includes voice ability evaluation and emotional state. Output is a training plan that includes specific exercise procedures.
[0570] Step 5:
[0571] The generated voice training plan and feedback are sent to the device. The device's user interface visualizes and presents this information to the user. Feedback may also be provided via voice. The input data is the generated training plan. The output is visual or auditory information in a format easily understood by the user.
[0572] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0573] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include those described above. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions shown by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0574] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.
[0575] [Fourth Embodiment]
[0576] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.
[0577] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.
[0578] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0579] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.
[0580] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0581] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0582] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0583] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.
[0584] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0585] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0586] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0587] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0588] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0589] This invention is a system that acquires voice data and provides the user with optimal voice training based on the analysis results. In implementation, the system mainly consists of a terminal for collecting voice data, a server for analysis, and the user.
[0590] First, the user records audio using their device. The device is an electronic device such as a smartphone or tablet, and the user launches a recording application. The user confirms that the microphone is properly connected to the device, presses the record button, and sings a portion of the designated song.
[0591] Once recording is complete, the device sends the audio data to the server. The audio data is converted to a digital format and securely sent to the server via the internet along with the user's identification information. The server receives this data and begins its analysis.
[0592] The server uses machine learning algorithms to analyze the received audio data. The analysis primarily measures features such as pitch, rhythm, volume, and tone quality. The server comprehensively evaluates these features and generates a numerical singing ability score.
[0593] The server then consults a database based on the evaluation score to create an optimal voice training plan for the user. For example, if the user's pitch is evaluated as unstable, the server can generate a training plan that includes scale exercises.
[0594] Finally, the server sends back the generated voice training plan and evaluation results to the terminal. The terminal visualizes this information on its user interface. The user can then review this information and use it to improve their daily voice training.
[0595] This system provides users with a personalized and efficient singing improvement program. For example, to improve pitch accuracy, the server might suggest that the user repeatedly practice specific scales or melodic patterns. Thus, this system becomes a useful tool for all users who want to improve their singing skills.
[0596] The following describes the processing flow.
[0597] Step 1:
[0598] The user launches a recording application on their device and configures the audio input device appropriately. When the user presses the record button, recording begins, and audio data is collected via the device's microphone.
[0599] Step 2:
[0600] The device formats the recorded audio as a digital audio file and temporarily stores the audio data in memory. After recording is complete, the user can stop recording by pressing the stop recording button.
[0601] Step 3:
[0602] The device packets the stored voice data and associated user identification information and sends them to the server using a secure protocol (e.g., HTTPS).
[0603] Step 4:
[0604] The server receives the audio data, decodes it, and verifies its integrity. If there are no problems, the audio data is input into a machine learning algorithm to begin analysis.
[0605] Step 5:
[0606] The server performs audio analysis and extracts audio features such as pitch, rhythm, and volume. Based on this, it evaluates the user's singing ability and generates a numerical score.
[0607] Step 6:
[0608] The server consults a database and creates a voice training plan recommended to the user based on their evaluation score. This includes specific instructions for scale practice and improving expressiveness.
[0609] Step 7:
[0610] The server sends the generated evaluation results and training plan to the terminal.
[0611] Step 8:
[0612] The device displays the received data on the user interface, allowing the user to review training results and suggestions. Based on this, the user proceeds with their daily practice.
[0613] (Example 1)
[0614] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0615] Providing an effective voice training plan to improve individual singing abilities is challenging. Existing technologies rely heavily on subjective evaluation of voice information and uniform training plans, resulting in insufficient feedback tailored to individual user characteristics. Furthermore, developing such plans is time-consuming and labor-intensive, hindering efficient improvement of singing abilities.
[0616] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0617] In this invention, the server includes means for analyzing voice information and evaluating singing ability, means for generating an individual voice training plan based on the evaluation results, and means for providing the generated voice training plan to the user. This makes it possible to provide the user with objective, efficient, and individually optimized voice training.
[0618] "Audio information" refers to digital or analog data recorded in a form that includes the frequency, intensity, and temporal changes of sound.
[0619] The term "device" refers to a combination of rigid and flexible elements used to achieve a specific function or operation.
[0620] An "analysis device" refers to computing resources used to process input data and extract or analyze specific information.
[0621] "Singing ability" is an evaluation of the ability to sing with musically accurate pitch, rhythm, volume, and tone quality.
[0622] An "evaluation indicator" is a way of expressing a specific ability or performance using numerical values or a scale.
[0623] A "voice training plan" refers to a plan that includes specific practice content and steps to improve the characteristics of the subject's voice.
[0624] "User" refers to an individual or group that operates this system and is eligible to receive a training plan.
[0625] An "information storage device" refers to a medium or device that stores data and information and allows it to be retrieved as needed.
[0626] To implement this invention, the user first records audio information using a device such as a smartphone or tablet. A common recording application is installed on the device, and the microphone is configured to function correctly. The user presses the record button and sings a portion of a designated song to collect audio information. Once the recording is complete, the device converts the audio information into a digital format and sends it to a server via the internet. Encryption technology is used to ensure the security of the data during transmission.
[0627] The server can utilize speech analysis software to analyze the received audio information. Specifically, it uses the Python LibROSA library and machine learning algorithms to measure characteristics such as pitch, rhythm, volume, and sound quality, and generates evaluation metrics. These evaluation metrics are compared to standard data by a generative AI model, and the user's singing ability is quantified.
[0628] The server generates an individualized voice training plan based on the analysis results. This plan is created using various training methods and historical training data stored in the database. By referring to historical data, an optimized plan tailored to the user's characteristics is provided. For example, if the server determines that the user's pitch is unstable, it can provide a training plan that includes pitch practice and melody pattern practice.
[0629] Finally, the server sends the generated voice training plan and evaluation results back to the terminal. The terminal displays this information in its user interface, allowing the user to refer to the results and training plan. The user can then perform daily voice training based on the displayed information.
[0630] As a concrete example, a prompt message to be input to the generating AI model could be something like, "Please calculate an evaluation score using the user's voice data and generate a training plan." This system allows users to efficiently improve their singing ability.
[0631] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0632] Step 1:
[0633] The user records audio information using the device. After the user launches the recording application and confirms that the microphone is securely connected, they press the record button and sing the designated song. This input is the user's raw voice data. When recording is complete, the device converts the recorded analog audio into a digital format. The output is the digitized audio data.
[0634] Step 2:
[0635] The terminal sends digital audio data to the server. Before transmission, the audio data is encrypted along with the user's identification information. The input is digitized audio data, and the output is encrypted data transmitted to the server via the internet.
[0636] Step 3:
[0637] The server receives audio data and begins analysis. The input is encrypted audio data, which is first decrypted. The decrypted data is fed into a machine learning algorithm to extract features such as pitch, rhythm, volume, and sound quality. Audio processing libraries such as LibROSA can be used in this analysis process. The output is the feature vectors extracted from the audio data.
[0638] Step 4:
[0639] The server generates evaluation metrics based on extracted features. The input consists of features from audio data, and a generative AI model is used to compare these features with standard data, thereby quantifying the user's singing ability. The output is a numerical value representing the evaluation metric.
[0640] Step 5:
[0641] The server generates individualized voice training plans based on evaluation metrics and by referencing a database. Inputs include evaluation metrics and user characteristic data. The server selects the optimal training menu from historical data and training samples to create an individualized plan. The output is a voice training plan optimized for the user.
[0642] Step 6:
[0643] The server sends the generated voice training plan and evaluation metrics back to the terminal. The input is the voice training plan and evaluation metrics, which are retransmitted to the terminal via the internet. The output is the training plan and evaluation results sent to the terminal.
[0644] Step 7:
[0645] The terminal displays the received information on the user interface. The input consists of a voice training plan and evaluation metrics, which are visualized on the screen in a user-friendly format. Users can use this output to improve their daily voice training.
[0646] (Application Example 1)
[0647] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0648] In recent years, there has been a growing demand for entertainment and learning support using home-use electronic devices. However, a challenge remains: there is a lack of effective means to improve users' singing abilities while providing individualized feedback. Conventional systems offer limited evaluation and feedback, making it difficult for users to efficiently improve their singing skills.
[0649] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0650] In this invention, the server includes means that is a device for acquiring voice information, means that transmit the acquired voice information to an information processing device for analysis, and means that provide feedback that allows the user to improve their singing ability while interacting with the home-use device. This makes it possible for the user to receive individually customized feedback in real time within their home.
[0651] "Audio information" refers to data related to the voice and singing performances emitted by users.
[0652] "Device" refers to a physical device that has the function of acquiring and transmitting audio information.
[0653] An "information processing device" refers to a system that analyzes received audio information and performs calculations to generate appropriate indicators and feedback.
[0654] "Analysis" refers to the process of extracting features from audio information using specific machine learning models or algorithms and generating evaluation metrics.
[0655] "Evaluation metrics" refer to numerical or qualitative criteria related to a user's singing ability and characteristics, calculated based on analyzed audio information.
[0656] "Singing ability" refers to the user's singing ability and characteristics, including pitch, rhythm, and tone quality.
[0657] "Feedback" refers to information that includes specific areas for improvement and points of concern, provided to users based on the analyzed results.
[0658] "Household machinery and devices" refers to robots and electronic devices used in a home environment, such as for entertainment or learning support.
[0659] The system for implementing this invention has a series of processes for acquiring voice information, analyzing it, and providing feedback to the user. First, the device used by the user is equipped with a microphone and has the function of acquiring voice information. The acquired voice information is transmitted via the internet to a server, which is an information processing device. In this process, the data is transmitted securely using security protocols.
[0660] The server uses machine learning algorithms to analyze audio information. Specifically, machine learning tools such as TensorFlow are used to identify audio features such as pitch, rhythm, volume, and tone quality. The server analyzes these features and generates evaluation metrics. Based on the evaluation results, individual singing training plans are generated. This process also references data from historical data storage devices.
[0661] The generated feedback and training plans are then returned to the user's device via the internet. Users can interact with the home-use device and use the generated feedback to improve their singing ability. This system enables real-time feedback; for example, when a user sings a particular song, they might receive specific feedback such as, "Your pitch is slightly off. Try singing it a little slower."
[0662] By using a generative AI model for analysis, users can receive personalized feedback in real time. As an example of this process, the prompt message to the generative AI model would be, "Analyze the following audio data to evaluate the user's singing ability and generate and present an appropriate training plan."
[0663] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0664] Step 1:
[0665] The device acquires the user's voice information through the microphone. The input is the user's singing data, and the output is an audio file converted into a digital format. The device sends this audio file to the server via the internet.
[0666] Step 2:
[0667] The server first inputs the received audio file into a machine learning algorithm to analyze it. The input is the audio file, and the output is audio features (pitch, rhythm, volume, sound quality, etc.). The server uses TensorFlow to extract audio features and generate evaluation metrics.
[0668] Step 3:
[0669] The server creates an individualized singing training plan for each user based on the generated evaluation metrics. The input is the evaluation metrics obtained through voice analysis, and the output is the details of the training plan. The server optimizes the training plan by referring to past database data.
[0670] Step 4:
[0671] The server sends the generated singing training plan and evaluation metrics to the terminal. The input is the training plan and evaluation metrics, and the output is the status of successful transmission to the terminal. The server sends the data to the terminal using a secure protocol.
[0672] Step 5:
[0673] The terminal visualizes the received singing training plan and evaluation metrics on the user interface. The input is data received from the server, and the output is the feedback screen the user sees. The terminal provides this information to the user in an easy-to-understand graphical format.
[0674] Step 6:
[0675] Users practice improving their singing ability based on feedback while interacting with a home-use device. The input is feedback and training plans on the device, and the output is the user's improved singing ability. Users can adjust pitch and repeat rhythm exercises according to instructions.
[0676] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0677] This invention is a system that combines a system for improving a user's singing ability with an emotion engine that recognizes the user's emotions. This system mainly consists of a terminal, a server, and an emotion engine.
[0678] In use, the user first launches the application on their device and performs voice recording. The device captures the user's singing voice through the microphone and converts the audio data into a digital format. This data, along with the user's identification information, is transmitted to the server in a secure manner.
[0679] The server receives audio data using a secure protocol, decodes it, and verifies its integrity. Next, it uses machine learning algorithms and an emotion engine to analyze the received audio data. Here, the machine learning algorithms extract audio features such as pitch, rhythm, and volume, and quantify singing ability. Simultaneously, the emotion engine analyzes the emotions in the audio and recognizes the user's emotional state.
[0680] Once the analysis is complete, the server considers the features and evaluation scores obtained from the audio, as well as the user's emotional state, to design a personalized voice training plan. For example, if it detects that the user is nervous, it can suggest feedback and practice exercises to help them relax.
[0681] Next, the server sends the generated voice training plan and emotion-based feedback to the device. The device visualizes this information on the user interface and provides it to the user. The user can then carry out their daily training based on the suggestions they receive.
[0682] For example, if the emotion engine detects "anxiety" during voice training, the server will not only adjust the voice but also suggest a relaxation time, programming the system to help the user reduce stress. By considering the user's emotional state in this way, more effective improvement in singing ability can be expected. This invention is technically easy to implement and provides an efficient training environment by flexibly responding to the individual needs of the user.
[0683] The following describes the processing flow.
[0684] Step 1:
[0685] The user launches the application on their device and prepares to record. By pressing the record button, the device uses the microphone to record the user's singing voice and saves it as digital audio data.
[0686] Step 2:
[0687] The device converts the recorded audio data into a digital format and sends it to the server along with the user's identification information. The transmission is performed using a secure protocol.
[0688] Step 3:
[0689] The server verifies the integrity of the received audio data and analyzes it using a machine learning algorithm. This analysis extracts audio features (pitch, rhythm, volume) and generates a singing ability score.
[0690] Step 4:
[0691] The server uses an emotion engine to analyze emotional information within the voice data and identify the user's emotional state. For example, it determines emotions from factors such as voice tone and tempo.
[0692] Step 5:
[0693] The server combines voice characteristics with the user's emotions to design a personalized voice training plan. For example, for a user who needs to relax, the training menu will include exercises that take that into consideration.
[0694] Step 6:
[0695] The server sends the generated training plan and feedback to the user's device. Sentiment-based messages are also included as needed.
[0696] Step 7:
[0697] The device displays a training plan and feedback to the user through its user interface. The user follows the instructions received and begins their daily training.
[0698] Step 8:
[0699] The user continues practicing and returns to the cycle of recording again. The system facilitates continuous improvement through this cycle.
[0700] (Example 2)
[0701] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0702] Conventional voice training systems only quantify and evaluate a user's voice ability, making it difficult to provide training plans that take into account the user's emotional state. As a result, emotions such as tension and anxiety can affect singing ability, leading to challenges in providing effective training. Furthermore, generating training plans that meet the individual needs of each user requires more flexible and detailed analysis.
[0703] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0704] In this invention, the server includes means for analyzing voice information and evaluating voice ability, means for recognizing emotional states in the voice information using an emotion analysis device, and means for generating individual voice training plans based on the analysis results. This makes it possible to provide more effective and personalized training plans that also take into account the user's emotional state.
[0705] "Audio information" refers to data that represents speech in digital format, and is the subject of capturing and analyzing the user's voice.
[0706] "Device" refers to equipment or devices used to acquire audio information, such as microphones and terminals equipped with recording functions.
[0707] A "central processing unit" refers to a system with computing power, such as a server or cloud computer, that centrally analyzes and processes audio information.
[0708] "Vocal ability" refers to characteristics that indicate a user's singing ability and vocalization ability, and is an evaluation index that is quantified through the analysis of voice information.
[0709] An "emotion analysis device" refers to a system equipped with a mechanism or algorithm to recognize emotional information from voice information, and is a technology for analyzing a user's emotional state.
[0710] A "voice training plan" is a training program or instructional guideline designed to improve a user's voice abilities, generated based on their analyzed voice abilities and emotional state.
[0711] An "information storage system" refers to a database system that stores past data and user history, and allows users to access that data as needed.
[0712] This invention is a voice training system aimed at improving a user's voice abilities, and is characterized by its inclusion of an emotion analysis function. This system mainly consists of a terminal, a central processing unit, and an emotion analysis device, and provides the user with an individualized voice training plan through the acquisition, analysis, and provision of feedback of voice information.
[0713] The user first launches an application on their device and records their voice using the microphone. The device then converts the acquired voice into a digital format and transmits it to a central processing unit using a secure protocol. The central processing unit utilizes machine learning algorithms to analyze the received voice information and evaluate the user's voice abilities. It also recognizes emotions in the voice using an emotion analyzer and incorporates this into the training plan.
[0714] Specifically, the voice training plan generated by the central processing unit is adjusted based on the user's voice characteristics and emotional state. For example, if the system recognizes that the user is feeling anxious during recording, the training plan may include exercises to promote relaxation. Furthermore, the analysis results are quantified as evaluation metrics, allowing the user to track their progress.
[0715] This system is expected to enable a more personalized approach to users and improve their voice abilities more effectively compared to conventional technologies. An example of a prompt generated by the AI model is, "Analyze the singing voice recorded by the user and generate an optimal training plan based on the voice characteristics and emotional state." This allows users to receive scientifically-based and effective training.
[0716] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0717] Step 1:
[0718] The user launches an application on the device and starts recording voice. The user records their voice through the microphone. The device converts the voice into a digital format and prepares it as voice information along with the user's identification information. This input consists of the user's voice data and identification information, which forms the basis for transmission to the central processing unit.
[0719] Step 2:
[0720] The terminal sends the prepared audio information to the server using a secure protocol (e.g., HTTPS). The transmitted audio information becomes the server's input. The server decodes the received audio information and verifies the integrity of the data. This process checks the integrity of the audio data and prepares it for analysis.
[0721] Step 3:
[0722] The server analyzes the audio information. The input is the audio data decoded in step 2. Machine learning algorithms are used for the analysis to extract audio features (pitch, rhythm, volume, etc.). This data processing quantifies the voice ability and generates evaluation metrics. Simultaneously, an emotion analysis device operates to identify the emotional states contained in the audio. The output of this process is the voice evaluation score and emotional state data.
[0723] Step 4:
[0724] The server generates individualized voice training plans based on speech features and emotional states. The input is the evaluation score and emotional state obtained in step 3, and a generative AI model is used to create prompt sentences based on this. Specifically, the plan includes concrete instruction such as "incorporate relaxation methods to alleviate user tension into the training plan." The output of this step is the individualized voice training plan.
[0725] Step 5:
[0726] The server sends the created voice training plan to the terminal. The terminal decodes the received plan and displays it in the user interface. Based on the displayed plan, the user can perform their daily training. The output of this final step is the voice training plan and feedback that the user receives.
[0727] (Application Example 2)
[0728] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0729] In recent years, training systems aimed at improving users' voice abilities have become widespread. However, these systems primarily focus on evaluating abilities based on the characteristics of voice data and do not provide training that takes into account the user's inner emotional state. Therefore, there is a need to develop a system that accurately grasps the user's emotional state and effectively improves voice abilities through feedback based on that understanding.
[0730] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0731] In this invention, the server includes means for analyzing voice data and evaluating voice ability, means for analyzing the user's emotional state from the acquired voice data, and means for generating emotion-appropriate feedback based on the user's emotional state. This enables personalized voice training that takes the user's emotions into consideration.
[0732] "Audio data" refers to data obtained by digitizing an audio signal and converting it into a format that can be processed.
[0733] A "device" is hardware or a combination of hardware designed to perform a specific function.
[0734] A "central processing unit" is a computer system or server used to receive, analyze, and process data.
[0735] "Vocal ability" refers to a numerical or evaluative representation of vocal expressiveness and technique.
[0736] A "voice training plan" is a plan that combines a series of practice methods and activities to improve voice ability.
[0737] A "user" is an individual or agent who uses the system.
[0738] "Emotional state" refers to the user's internal psychological condition, as analyzed based on voice data.
[0739] "Feedback" refers to information or advice provided to the user based on the analysis results.
[0740] An "information collection" is a set of data, including past data and history, that is used for analysis and optimization.
[0741] The server plays a central role in this system, analyzing the user's voice data. The user acquires their own voice using a device equipped with voice recording capabilities. For example, a household robot can function as this device. The voice data is transmitted from the terminal to the central processing unit, where it is stored as digital data.
[0742] The server uses a Python-based speech analysis library (e.g., librosa) to extract features from the speech data. Speech ability is then numerically evaluated by a machine learning model based on these features. Furthermore, an emotion analysis engine (e.g., IBM Watson Emotion Analysis) is used to understand the emotional state from the acquired speech. This process reveals the user's inner emotions.
[0743] Next, the server generates a voice training plan that takes into account the user's emotional state and voice evaluation. For example, if the user is identified as anxious, simple breathing exercises to help them relax are added to the training plan. These plans are further refined by leveraging optimization algorithms based on underlying data and by comparing them with past data sets.
[0744] The generated training plan and feedback are then sent to the device and visualized through an interactive user interface. The robot can also provide advice to the user through voice feedback.
[0745] For example, if the system detects the emotion of "anxiety" during voice training, the server will provide appropriate feedback corresponding to that emotion. By using a prompt such as, "Analyze my emotions in response to my reading and suggest feedback and exercises," the system can develop feedback tailored to the user. In this way, voice training incorporating emotion analysis can provide a more comprehensive and effective learning experience.
[0746] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0747] Step 1:
[0748] The device captures the user's voice using a microphone. The input data is an audio signal. The audio is converted to a digital format and saved as an audio file. This data, along with the user's identification information, is transmitted to the server in a secure manner.
[0749] Step 2:
[0750] The server securely stores the received audio data and analyzes it using an audio analysis library (e.g., librosa). The input data is a digital audio file sent from the terminal. Numerical information such as audio features, pitch, rhythm, and volume is extracted, and an output is generated that evaluates the audio capability.
[0751] Step 3:
[0752] The server uses an emotion recognition engine (e.g., IBM Watson Emotion Analysis) to determine the user's emotional state from the audio data. The input data consists of audio features. The engine analyzes the intonation and rhythm patterns in the audio and outputs the emotional state as "joy" or "anxiety."
[0753] Step 4:
[0754] The server generates an individualized voice training plan based on voice evaluation and emotional state. This plan is optimized by referencing a database of past data. Input data includes voice ability evaluation and emotional state. Output is a training plan that includes specific exercise procedures.
[0755] Step 5:
[0756] The generated voice training plan and feedback are sent to the device. The device's user interface visualizes and presents this information to the user. Feedback may also be provided via voice. The input data is the generated training plan. The output is visual or auditory information in a format easily understood by the user.
[0757] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0758] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include those described above. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions shown by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0759] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.
[0760] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.
[0761] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.
[0762] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.
[0763] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.
[0764] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.
[0765] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."
[0766] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.
[0767] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.
[0768] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.
[0769] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.
[0770] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.
[0771] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.
[0772] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.
[0773] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.
[0774] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.
[0775] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.
[0776] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.
[0777] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.
[0778] The following is further disclosed regarding the embodiments described above.
[0779] (Claim 1)
[0780] A means of acquiring audio data,
[0781] A means of sending the acquired audio data to a server for analysis,
[0782] A means of analyzing transmitted audio data and evaluating singing ability,
[0783] A means for generating individual voice training plans based on analysis results,
[0784] A means of providing the generated voice training plan to the user,
[0785] A system that includes this.
[0786] (Claim 2)
[0787] The system according to claim 1, comprising means for generating an evaluation score based on features extracted from audio data.
[0788] (Claim 3)
[0789] The system according to claim 1, comprising means for optimizing individual voice training plans by referring to a past database.
[0790] "Example 1"
[0791] (Claim 1)
[0792] A means of a device that acquires audio information,
[0793] Means for transmitting acquired audio information to an analysis device,
[0794] A means for analyzing transmitted audio information and evaluating singing ability,
[0795] A means for generating individual voice training plans based on evaluation results,
[0796] A means of providing the generated voice training plan to the user,
[0797] A system that includes this.
[0798] (Claim 2)
[0799] The system according to claim 1, comprising means for generating an evaluation index based on characteristic values extracted from audio information.
[0800] (Claim 3)
[0801] The system according to claim 1, comprising means for optimizing individual voice training plans by referring to existing information storage devices.
[0802] "Application Example 1"
[0803] (Claim 1)
[0804] A means of a device that acquires audio information,
[0805] A means for transmitting acquired audio information to an information processing device for analysis,
[0806] A means for analyzing transmitted audio information and evaluating singing ability,
[0807] A means for generating individual singing training plans based on analysis results,
[0808] A means of providing the generated singing training plan to the user,
[0809] A means of providing feedback that allows users to improve their singing ability while interacting with a home-use electronic device,
[0810] A system that includes this.
[0811] (Claim 2)
[0812] The system according to claim 1, comprising means for generating an evaluation index based on features extracted from audio information.
[0813] (Claim 3)
[0814] The system according to claim 1, comprising means for optimizing individual singing training plans by referring to a past information storage device.
[0815] "Example 2 of combining an emotion engine"
[0816] (Claim 1)
[0817] A means of a device that acquires audio information,
[0818] A means for transmitting acquired audio information to a central processing unit for analysis,
[0819] A means for analyzing transmitted voice information and evaluating voice capabilities,
[0820] A means for generating individual voice training plans based on analysis results,
[0821] A means of providing the generated voice training plan to the user,
[0822] A means of recognizing emotional states in audio information using an emotion analysis device and reflecting them in a training plan,
[0823] A system that includes this.
[0824] (Claim 2)
[0825] The system according to claim 1, comprising means for generating an evaluation index based on features and emotional states extracted from audio information.
[0826] (Claim 3)
[0827] The system according to claim 1, comprising means for optimizing individual voice training plans by referring to past information storage systems.
[0828] "Application example 2 when combining with an emotional engine"
[0829] (Claim 1)
[0830] A means of acquiring audio data,
[0831] A means for transmitting acquired audio data to a central processing unit for analysis,
[0832] A means for analyzing transmitted audio data and evaluating speech ability,
[0833] A means for generating individual voice training plans based on analysis results,
[0834] A means of providing the user with a generated audio training plan,
[0835] A means of analyzing the user's emotional state from acquired audio data,
[0836] A means for generating emotion-responsive feedback based on the user's emotional state,
[0837] A system that includes this.
[0838] (Claim 2)
[0839] The system according to claim 1, comprising means for generating an evaluation value based on features extracted from audio data.
[0840] (Claim 3)
[0841] The system according to claim 1, comprising means for optimizing individual voice training plans by referring to a collection of past data. [Explanation of symbols]
[0842] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>
Claims
1. A means of a device that acquires audio information, A means for transmitting acquired audio information to an information processing device for analysis, A means for analyzing transmitted audio information and evaluating singing ability, A means for generating individual singing training plans based on analysis results, A means of providing the generated singing training plan to the user, A means of providing feedback that allows users to improve their singing ability while interacting with a home-use electronic device, A system that includes this.
2. The system according to claim 1, comprising means for generating an evaluation index based on features extracted from audio information.
3. The system according to claim 1, comprising means for optimizing individual singing training plans by referring to a past information storage device.