Artificial intelligence human-machine collaboration synthesis microphone
Through a cloud-edge-device structured intelligent microphone system, it is possible to extract human voice features, train a dedicated acoustic model, and synthesize singing voices using AI. This solves the problem of real-time collaboration of traditional microphones in group singing scenarios and improves the convenience and accuracy of singing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- FOSHAN KARAYI SINGING MUSICAL INSTRUMENT CO LTD
- Filing Date
- 2025-11-26
- Publication Date
- 2026-06-19
AI Technical Summary
Traditional microphones have limited functionality and cannot meet users' needs for multi-part ensemble and real-time collaborative singing in personalized creation and intelligent interaction, especially in group singing scenarios where there is a lack of intelligent microphone devices that enable real-time human-computer collaboration.
A smart microphone system based on a cloud-edge-device architecture was designed, including a microphone, a mobile phone, and a cloud backend. It enables human voice feature extraction, dedicated acoustic model training, and AI-synthesized singing voice. It supports real-time human-machine collaborative singing. Through communication between the microphone and the mobile phone, it enables the download and control of track files, and combines a pitch detection module to ensure accurate singing pitch.
It enables users to select songs and vocal parts on their mobile phones, supports real-time human-computer collaboration, improves the convenience and accuracy of singing, is especially suitable for group singing scenarios, reduces errors in group performances caused by off-key singing, and improves singing performance and error tolerance.
Smart Images

Figure CN122248319A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of intelligent audio equipment and artificial intelligence technology, and in particular to an intelligent microphone system that integrates AI voice synthesis, dedicated acoustic model training and real-time human-machine collaborative singing. Background Technology
[0002] Traditional microphones primarily function to acquire, amplify, reduce noise, and transmit audio signals, following a linear "acquisition-processing-output" process for raw audio. The advancement of artificial intelligence (AI) technology has led to the emergence of smart microphones. Smart microphones integrate AI technology with audio processing capabilities, significantly improving functionality and ease of use. For example, they employ AI algorithms for deep noise reduction, support omnidirectional sound pickup, and offer a wide pickup range. However, as users increasingly demand personalized sound creation and intelligent interaction, such as in personal entertainment activities where live performances require real-time complementarity, or in large-scale amateur music social events where multi-part, multi-instrument ensembles are needed, the absence of certain singers or unfamiliarity with the score necessitates the use of a similar voice during rehearsals and performances. Therefore, the functionality of smart microphones needs further expansion. Thus, a smart microphone device integrating acquisition, modeling, synthesis, and real-time collaborative singing is required to meet users' personalized needs and broaden application scenarios. Summary of the Invention
[0003] The purpose of this invention is to provide an AI human-machine collaborative synthesis microphone that can extract human voice features, train a dedicated acoustic model, and synthesize AI singing voices. It also supports real-time human-machine collaboration during live performances and is particularly suitable for group singing scenarios.
[0004] To achieve the above objectives, the present invention adopts the following technical solution: This microphone system is based on a cloud-edge-device architecture, including: a microphone terminal (device), a mobile phone (edge), and a cloud backend (cloud). The microphone terminal includes a shell, power module, control module, control panel, acquisition module, AI-synthesized singing module, playback module, storage module, communication module, and pitch detection module.
[0005] The mobile phone is responsible for establishing a communication bridge between the cloud backend and the microphone end, including wireless communication with the cloud backend and wired / wireless communication with the microphone end: uploading vocals, downloading sheet music, controlling the microphone end, human-computer interaction with the microphone end, and displaying sheet music.
[0006] The main person in charge of the cloud backend is responsible for voiceprint extraction and training of a dedicated acoustic model, processing musical score files, and generating track files, including several audio unit files and track information files. The audio unit files are audio files of a specific word at a specific pitch, while the track information files contain the pitch, duration, intensity, and start time information of each word in the lyrics.
[0007] The microphone end structure includes the following components: The outer casing is used to house and protect all internal components.
[0008] The power module is used to supply power to the entire device.
[0009] The control module, as the core control unit of the system, coordinates the collaborative work of various components at the microphone end and realizes key functions such as track file synthesis and processing, working mode switching, and real-time pitch calibration through interaction with the cloud backend.
[0010] The communication module is used for data interaction with the mobile phone, which can be achieved directly via wired USB or wireless means, such as WiFi or Bluetooth.
[0011] The control panel, located on the casing, includes two buttons for adjusting the microphone volume, a power button (long press to turn the device on or off), and buttons for switching between human singing mode and machine singing mode.
[0012] The acquisition module is used to collect the user's raw vocal data, as well as the live vocal data during performances. The vocal data is transmitted to the cloud backend for extraction of voiceprint features.
[0013] The AI-generated singing voice module is used to synthesize a singing voice with user voiceprint characteristics based on the song's timeline and the audio units, note pitch, duration, intensity, and start time information contained in the song file.
[0014] The playback module includes a microphone with built-in speakers for directly playing audio. The audio stream to the playback module has two channels: a vocal channel and a background channel. The vocal channel contains synthesized vocals or captured live vocals. The background channel contains an audio stream directly provided by the phone, which can be accompaniment, background music, etc.
[0015] The storage module is used to store the data collected by the acquisition module, the downloaded track files, the data of the AI-synthesized vocals, and the background audio stream files.
[0016] The pitch detection module analyzes the live vocal data captured by the acquisition module to obtain the pitch and duration of the user's singing, and compares it with the standard pitch and duration in the song file. When a mismatch is detected, the control module switches the data in the vocal channel of the playback module to synthesized vocal data to ensure accurate pitch during the live performance. When the deviation between the live vocal data and the standard value returns to within the threshold range, the control module automatically switches the data in the vocal channel back to the captured live vocal data.
[0017] Microphone system workflow: 1. The control module controls the acquisition module to collect the user's raw voice data, which is then transmitted to the mobile phone via the communication module and uploaded to the cloud backend by the mobile phone. The mobile phone sends a modeling command to the cloud backend, triggering the cloud backend to extract voiceprint features such as spectrum, cepstral, and pitch based on AI algorithms, train and generate a user-specific acoustic model, and bind the model parameters with the user ID and store them in the cloud backend database.
[0018] 2. Users select songs and vocal parts on their mobile phones.
[0019] 3. Send the song title, vocal part, and user ID information to the cloud backend via mobile phone.
[0020] 4. Based on the user's unique acoustic model and the lyrics included in the selected vocal part of the song, the cloud backend generates a unique song file for that user ID and pushes it to the mobile phone. The song file includes several audio unit files and a song information file.
[0021] The above process specifically includes: First, the cloud backend generates a vocal stream based on the selected vocal part of the song, including the words, their pitch and start time. Then, it is input into the exclusive acoustic model corresponding to the user ID to obtain an audio unit file with the user's voiceprint characteristics, which is the audio file of each word in the lyrics.
[0022] In addition, the cloud backend generates a song file based on the selected vocal part of the song. The song information file contains the pitch, duration, intensity, and start time information of each word in the lyrics.
[0023] 5. The mobile phone receives the track files generated by the cloud backend and downloads them to the microphone, storing them in the microphone's storage module.
[0024] 6. Users select whether background sound effects are needed on their phones. If background sound effects are needed, users can first select the corresponding background sound effects for the track on their phones, then generate a background audio stream file and send it to the microphone. Users can also choose to create a custom background audio stream file on their phones, based on the audio units of the track file, the start time and duration of each word, and sequentially overlay drum beats, ambient sounds, and harmony layers along the timeline to generate a personalized background audio stream file. The background audio stream file is downloaded to the microphone and stored in the microphone's storage module.
[0025] 7. Users can select either human singing mode or machine singing mode via the control panel.
[0026] 8. During the singing phase, the mobile phone displays the corresponding sheet music synchronously, and the microphone performs synthesis and playback. In machine singing mode, the microphone performs real-time vocal synthesis; in human singing mode, the microphone performs real-time vocal synthesis and pitch correction, achieving human-machine collaborative singing.
[0027] The microphone playback module uses dual channels: one for vocals and one for background. The vocal channel contains either synthesized vocals or captured live vocals. In machine-generated vocal mode, the AI vocal synthesis module synthesizes vocals based on the track file and loads them into the vocal channel. In human-generated vocal mode, the microphone captures live vocals in real time, and the AI vocal synthesis module also synthesizes vocals in real time, but the data loaded into the vocal channel is live vocals.
[0028] The specific process includes: In machine singing mode: The AI-synthesized singing module at the microphone reads the track file from the storage module, extracts the audio unit file of the corresponding word according to the start time of each word in the lyrics of the track information file, and synthesizes it into a singing voice according to the start time and duration.
[0029] Meanwhile, if background sound effects are needed, the microphone control module reads the background audio stream file from the storage module and plays it in a dual-channel manner along the timeline and the synthesized vocals to form a complete music: that is, the vocal channel plays the synthesized vocals, and the background channel plays the background audio stream file along the timeline, with background sound effects superimposed.
[0030] In vocal mode: The playback module plays the live vocals recorded by the acquisition module in real time. Simultaneously, if background sound effects are needed, the microphone control module reads the background audio stream file from the storage module and plays it along the timeline with the live vocals in a dual-channel format to create a complete soundtrack: the vocal channel plays the recorded vocals, while the background channel plays the background audio stream file along the timeline, with background sound effects superimposed.
[0031] In human singing mode, the microphone performs real-time vocal synthesis while simultaneously activating the pitch detection module for pitch monitoring.
[0032] The above pitch monitoring instructions are as follows: In vocal recording mode, the acquisition module simultaneously captures live vocals, while the pitch detection module checks the pitch and rhythm of the captured vocals. If the pitch and duration do not match the corresponding lyrics in the song file, it is considered out of tune, and a pitch correction switching prompt is immediately issued. If the pitch and duration match the lyrics in the song file, a normal pitch prompt is immediately issued.
[0033] When the pitch detection module prompts for pitch correction switching, the synthesized vocals are loaded into the vocal channel, and the singing is marked as machine-generated. When the pitch detection module prompts for normal pitch, live vocals are loaded into the vocal channel, and the singing is marked as human-generated. The recorded markings are transmitted to the mobile phone via the communication module for display.
[0034] The above-mentioned off-key judgment can be set as follows: if the pitch and duration of the collected live vocals exceed the error threshold compared with the standard values in the track file, it is judged as off-key. The pitch error threshold can be set as: (measured pitch - standard pitch) / standard pitch < 5‰, if it does not meet the requirement, the pitch is off-key; the duration threshold can be set as: (measured duration - standard duration) < 0.25 beats, if it does not meet the requirement, the duration is off-key.
[0035] The AI human-machine collaborative synthesis microphone designed in this invention has the following advantages: 1. Convenient selection of suitable songs and vocal parts for singers. The system can extract human voice features, train a dedicated acoustic model, and synthesize AI-generated vocals. Singers can select songs and vocal parts on their mobile phones and listen to the AI-synthesized personalized voice with their own vocal characteristics singing the song. This allows them to easily preview whether they are suitable for singing a certain type of song or a certain vocal part. They can also play the AI-synthesized personalized voice with their own vocal characteristics a cappella or backing vocals with different background sound effects at the microphone end to preview the singing effect.
[0036] 2. Convenient for singing practice and individual rehearsals. When practicing singing, the singer selects the song and vocal part on their phone, and the corresponding sheet music for that part is displayed on the phone. This allows users to sight-sing while looking at the sheet music on their phone. If the microphone is set to machine-sing mode, the user learns to sing while listening to their own unique voice. Once proficient, the user can switch to human singing mode, which plays a mix of human and machine singing. If the human singer goes off-key, it automatically switches back to machine singing, displaying a machine singing substitution indicator on the corresponding sheet music on the phone. When the machine singing substitution indicator reaches 0 for the entire song, it means the user can sing the song perfectly. This is also particularly suitable for individual rehearsals during group performances, allowing the user to look at their own vocal part's sheet music, listen to their own singing, and even receive feedback while singing.
[0037] 3. Supports real-time human-machine collaborative singing for live performances, especially suitable for group singing scenarios. The microphone has a built-in speaker, eliminating the need for separate sound equipment. When a singer goes off-key, it automatically switches to machine singing, producing a synthesized version of the singer's voice. This eliminates concerns about errors during group performances, improving vocal expressiveness and error tolerance. Attached Figure Description
[0038] Figure 1 This is a system architecture block diagram of an AI human-machine collaborative microphone system module collaborative implementation in this embodiment; Figure 2 This is a schematic diagram illustrating the working process of training a dedicated acoustic model for an AI human-machine collaborative microphone system module based on a cloud-edge-device architecture, as described in this embodiment. Figure 3 This is a schematic diagram illustrating the working process of a specific implementation of an AI human-machine collaborative microphone system module for collaborative singing in this embodiment; Detailed Implementation
[0039] like Figure 1 The diagram shown is a system architecture block diagram of an AI human-machine collaborative microphone system module collaborative implementation according to this embodiment. The modules of the AI human-machine collaborative microphone system work collaboratively in the following manner: After the microphone is powered on, the power module supplies power to all components, and the control module automatically initializes the acquisition module, communication module, and storage module.
[0040] Control Module: As the core control unit of the system, the control module coordinates the collaborative work of various components at the microphone end. Through interaction with the cloud backend, it realizes key functions such as track file synthesis and processing, working mode switching, and real-time pitch calibration.
[0041] Communication module: This module enables data interaction with the mobile phone via wired USB or wireless methods, such as WiFi or Bluetooth. In this embodiment, the mobile phone and microphone establish a data connection via wired USB.
[0042] The functions of each button on the control panel are responded to by the control module: Press and hold the power button for 3 seconds to trigger the power-on / power-off command; press the power button briefly to switch the play / pause state; press the human singing mode button or machine singing mode button to immediately switch the corresponding working mode and start or stop the acquisition module and pitch detection module.
[0043] Acquisition module: Used to collect the user's raw vocal data, as well as the live vocal data collected during performances. The vocal data is then used to extract voiceprint features in the cloud.
[0044] AI-generated singing voice module: Based on the timeline, it synthesizes the audio units, note pitch, duration, intensity, and start time information contained in the track file to create a singing voice with the user's voiceprint characteristics, which is called a synthesized singing voice.
[0045] Playback Module: Includes the microphone's built-in speakers for direct audio playback. The audio stream to the playback module has two channels: a vocal channel and a background channel. The vocal channel contains synthesized vocals or captured live vocals. The background channel contains an audio stream directly provided by the phone, which can be accompaniment, background music, etc.
[0046] Storage module: Used to store the collected data, downloaded track files, AI-synthesized vocal data, and background audio stream files.
[0047] The pitch detection module is responsible for analyzing the pitch and duration data of the live vocals captured by the acquisition module in real time and comparing it with the standard pitch and duration data in the song file. When a mismatch is detected, the control module is triggered to switch the data in the vocal channel of the playback module to synthesized vocals to ensure accurate pitch during the live performance. When the deviation between the live vocals and the standard values returns to within the threshold range, the control module automatically switches the data in the vocal channel back to the captured live vocals.
[0048] like Figure 2 The diagram illustrates the workflow of a dedicated acoustic model training implementation based on a cloud-edge-device architecture within an AI human-machine collaborative microphone system module, as shown in this embodiment. Taking a user training a dedicated model using this system as an example, the specific steps are as follows: The user presses and holds the microphone power button to turn on the device, establishing a connection between the microphone and the phone. The phone then presses the start button for the acquisition module, which begins collecting 1-2 minutes of the user's raw vocal data, such as singing a specified melody a cappella.
[0049] The control module receives the collected raw human voice data and transmits the data to the bound mobile phone through the communication module. The mobile phone then uploads the data to the cloud backend via wireless communication.
[0050] Users send modeling instructions to the cloud backend from their mobile devices. After receiving the instructions, the cloud backend uses AI algorithms to extract voiceprint features such as the frequency spectrum, cepstrum, and pitch.
[0051] The cloud backend trains and generates a unique acoustic model based on the extracted voiceprint features, binds the model to the user ID, and stores it in the cloud database to complete the model training. After training is completed, the mobile phone receives a "model training successful" notification from the cloud backend.
[0052] like Figure 3The diagram shown illustrates the working process of a human-machine collaborative singing system module according to this embodiment. Taking a user singing a specific part of a song using this system as an example, the specific implementation process is as follows: (1) Music preparation stage Users open the corresponding application on their mobile phones, search for the target song, select the vocal part to be sung (such as the alto part), and click "Generate Custom Track".
[0053] The mobile phone sends the song title, selected vocal part information, and user ID to the cloud backend.
[0054] After receiving the information, the cloud backend calls the exclusive acoustic model bound to the user ID, and combines it with the lyrics of the target song to first generate the vocal syllable stream for that part, including the words, their pitch, and start and end times. Then, the vocal syllable stream is input into the exclusive acoustic model to obtain the vocal syllable stream with the user's voiceprint characteristics. Finally, it combines the musical score for that part to generate a track file containing audio units, pitch, and duration.
[0055] The cloud backend pushes the generated track file to the mobile phone. After the mobile phone receives the file, it downloads the track file to the storage module on the microphone via wired USB. Once the storage is complete, the indicator light on the microphone illuminates, indicating that the track is ready.
[0056] (2) Implementation of machine singing mode When the user presses the "Machine Singing Mode" button on the control panel, the control module receives the command, starts the AI-synthesized singing module, and shuts down the acquisition module and pitch detection module. The AI-synthesized vocal module reads the track file from the storage module, extracts the corresponding audio unit files according to the start time of each word in the lyrics of the track information file, and synthesizes them into a vocal file based on the start time and duration.
[0057] Meanwhile, if background sound effects are needed, the microphone control module reads the background audio stream file from the storage module, and the playback module plays the complete music through its built-in speakers in a dual-channel manner according to the timeline and the synthesized vocals: that is, the vocal channel plays the synthesized vocals, and the background channel plays the background audio stream file according to the timeline, with background sound effects superimposed.
[0058] Users can sing along or rehearse along with the song.
[0059] (3) Implementation of human singing mode When the user presses the "Human Singing Mode" button on the control panel, the control module receives the command and simultaneously activates the acquisition module, the AI-synthesized singing module, and the pitch detection module. The playback module plays the live audio captured by the acquisition module in real time. Simultaneously, if background sound effects are needed, the microphone control module reads the background audio stream file from the storage module and plays it along the timeline with the live audio in a dual-channel format to create a complete soundtrack: the vocal channel plays the captured vocals, while the background channel plays the background audio stream file along the timeline, overlaid with background sound effects.
[0060] In human singing mode, while the microphone synthesizes the vocals in real time, it also activates the pitch detection module to monitor pitch accuracy. The pitch detection module analyzes the vocal data collected by the acquisition module in real time, obtaining the pitch and duration data of the user's singing, and compares it with the standard pitch and duration in the song file. When the relative pitch error of a person's voice on site exceeds 0.5% or the duration deviation exceeds 0.25 beats compared to the standard value, it is determined to be out of tune, and the pitch detection module immediately sends a switching command to the control module.
[0061] After receiving the switching command, the control module switches the data in the vocal channel of the playback module to synthesized vocals to ensure accurate pitch during live performances.
[0062] When the pitch detection module detects that the relative pitch error between the live vocals and the standard value has recovered to within 0.5% and the duration deviation is ≤0.25 beats, it sends a recovery command to the control module, which then automatically switches the vocal channel data back to the collected live vocals.
[0063] (4) End of performance After the user finishes singing, a short press of the power button switches the control module to pause mode, and the acquisition module, AI-synthesized singing module, and pitch detection module stop working; a long press of the power button triggers the shutdown command, and the system shuts down.
Claims
1. An AI human-machine collaboration synthetic microphone system, characterized in that, Based on the cloud edge structure, including microphone end (end), mobile phone (edge) and cloud background (cloud); The microphone end includes a shell, a power module, a control module, a control panel, a collection module, an AI synthesized singing module, a playing module, a storage module, a communication module and a pitch detection module; The mobile phone is used for wireless communication with the cloud background: downloading music scores, transmitting data, and human-computer interaction; The cloud background is used for voiceprint extraction and exclusive acoustic model training, music score file processing, generating a music file, including a plurality of audio unit files and a music information file; wherein the audio unit file is an audio file of a certain character under a specific pitch, and the music information file is a file of the pitch, duration, intensity and starting time information of each character in the lyrics.
2. The AI human-computer collaborative synthesis microphone system of claim 1, wherein the control module serves as a core control unit, coordinates the cooperative work of each component of the microphone end, and realizes music file synthesis processing, working mode switching and real-time pitch calibration through interaction with the cloud background. The control panel includes a volume adjustment double key, a power button and a working mode switching key. The power button is long-pressed to turn on / off, and short-pressed to realize play / pause function; the working mode switching key includes a human singing mode key and a machine singing mode key, which are used to switch the working mode of the microphone.
3. The AI human-computer collaborative synthesis microphone system of claim 1, wherein the collection module is used to collect the original voice data of the user and the live voice during singing; the voice data is transmitted to the cloud background for extracting voiceprint features.
4. The AI co-operative synthetic microphone system of claim 1, wherein, The AI synthesized singing module synthesizes based on the time axis according to the audio unit, note pitch, duration, intensity and starting time information contained in the music file to obtain a song with user voiceprint features, which is called synthesized song.
5. The AI co-operative synthetic microphone system of claim 1, wherein, The playing module includes a microphone self-contained sound, which is used to directly play audio; there are two channels for audio stream to the playing module, one is a song channel, and the other is a background channel; the data of the song channel is synthesized song or collected live voice; the data of the background channel is a background audio stream file directly provided by the mobile phone, which can be accompaniment or background music.
6. The AI co-operative synthetic microphone system of claim 1, wherein, The storage module is used to store the data collected by the collection module, the downloaded music file, the data of the AI synthesized song and the background audio stream file. The communication module realizes data interaction with the mobile phone through wired USB or wireless mode.
7. The AI co-operative synthetic microphone system of claim 1, wherein, The pitch detection module is used to analyze the voice data collected by the collection module in real time to obtain the pitch and duration data of the user's singing, and compare them with the standard pitch and duration of the music file; When the live voice pitch and duration deviate from the standard value by more than a certain threshold, it is determined that the voice is out of tune, and the control module switches the data in the song channel of the playing module to the synthesized song; when the deviation of the live voice from the standard value is restored to within the threshold range, the control module automatically switches the data in the song channel back to the collected live voice.
8. The AI co-operative synthetic microphone system of claim 1, wherein, The process of training the exclusive acoustic model by the cloud background is as follows: The control module receives the original voice data of the user collected by the collection module, which is transmitted to the cloud background through the communication module and the mobile phone. The cloud backend uses AI algorithms to extract voiceprint features such as the frequency spectrum, cepstral spectrum, and pitch of human voices; Generate a unique acoustic model that is bound to the user ID and store it in a cloud database.
9. The AI co-operative synthetic microphone system of claim 1, wherein, The microphone system's workflow includes: The control module controls the acquisition module to collect the user's original voice data, which is then transmitted to the mobile phone via the communication module and uploaded to the cloud backend by the mobile phone. The mobile phone sends a modeling command to the cloud backend, triggering the cloud backend to extract voiceprint features based on AI algorithms, train and generate a user-specific acoustic model, and bind the model parameters with the user ID and store them in the cloud backend database. Users select songs and vocal parts to sing on their mobile phones; The mobile phone sends the song title, vocal part, and user ID information to the cloud backend; The cloud backend generates a unique track file for the user's ID based on the user's exclusive acoustic model and the lyrics of the selected vocal part of the song, and pushes it to the mobile phone. The track file includes several audio unit files and track information files. The mobile phone receives the track files generated by the cloud backend and downloads them to the microphone, then stores them in the storage module. Users select background sound effects on their mobile phones, generate background audio stream files, and send them to the microphone for storage. Users can select either human singing mode or machine singing mode through the control panel; During the performance, the corresponding sheet music is displayed on the mobile phone, and the microphone synthesizes and plays the music.
10. The AI co-operative synthetic microphone system of claim 9, wherein, Synthesis and playback in the described workflow: In machine singing mode, the AI synthesized singing module reads the track file in the storage module, extracts the audio unit file of the corresponding word according to the start time of each word in the lyrics of the track information file, and synthesizes it into a singing voice according to the start time and duration. If background sound effects are needed, the control module reads the background audio stream file from the storage module and plays it in a dual-channel manner according to the timeline and the synthesized vocals. In human voice mode, the playback module plays the live human voice recorded by the acquisition module in real time. At the same time, if background sound effects are needed, the control module reads the background audio stream file in the storage module and plays it in a dual-channel manner according to the timeline and the live human voice. In human singing mode, the pitch detection module detects the pitch and rhythm of the real-time acquired live human voice. When out-of-tune is detected, a pitch correction switching prompt is issued, and the control module loads the synthesized singing voice into the singing channel. When the pitch is detected to be back to normal, a pitch normal prompt is issued, and the control module loads the live human voice into the singing channel. The off-key judgment can be set as follows: if the pitch and duration of the collected live human voice exceed the error threshold compared with the standard value of the track file, it is judged as off-key.