Method, system and related device for AI online song cover
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENZHEN MAIFENG TECH CO LTD
- Filing Date
- 2026-02-25
- Publication Date
- 2026-06-19
Smart Images

Figure CN122245337A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of audio processing technology, and in particular to a method, system and related equipment for AI-powered online song cover singing. Background Technology
[0002] With the rise of short video, live streaming, and music sharing applications, users' demand for personalized and interactive music content is growing rapidly. In particular, online cover song functions allow users to transform original songs into their own vocal styles, fulfilling diverse needs such as entertainment, social interaction, and creation.
[0003] Traditional song cover production processes are complex. Users first need to find the instrumental version of the target song, record their a cappella vocals, and then use professional audio editing software to manually align, mix, and post-process the vocals and instrumental version. The whole process requires high levels of professional skills from users and is time-consuming and labor-intensive. Current technology lacks a solution that can automate the process of song covers.
[0004] Therefore, existing technologies still need to be improved and developed. Summary of the Invention
[0005] This invention provides a method, system, and related equipment for AI-powered online song cover singing. The main objective of this invention is to solve the technical problems mentioned in the background section of the prior art.
[0006] The first aspect of this invention provides a method for AI-powered online song cover singing, comprising: Obtain the original audio file of the song that has been authorized for cover; The sound source separation model is invoked to separate the original song audio file into an accompaniment track and an original vocal track; The original human voice track is input into a pre-trained retrieval-based speech conversion model, and the timbre of the original human voice track is converted into the target timbre to obtain a voice-changing human voice track. The altered vocal track and the accompaniment track are mixed and synthesized to generate an audio file of the cover song.
[0007] In an optional embodiment of the first aspect of the invention, the step of inputting the original human voice track into the pre-trained retrieval-based speech conversion model includes: The original vocal track is subjected to sound quality enhancement processing, and the signal-to-noise ratio quality of the enhanced original vocal track is evaluated to determine whether the enhanced original vocal track meets the preset clarity standard. In an optional embodiment of the first aspect of the present invention, the step of inputting the original vocal track into a pre-trained retrieval-based speech conversion model to convert the timbre of the original vocal track into a target timbre to obtain a voice-changing vocal track includes: Speech content features are extracted from the original human voice track, and the speech content features include phonemes and prosodic information of the speech. In the retrieval-based speech conversion model, the corresponding target timbre features are retrieved and matched based on the speech content features; The speech content features are fused with the target timbre features, and a voice-changing track is synthesized using a decoder.
[0008] In an optional embodiment of the first aspect of the present invention, the pre-training method of the retrieval-based speech conversion model includes: Acquire the target human voice audio data for training; The target human voice audio data is preprocessed, including unifying the sampling rate, audio segmentation, and filtering invalid audio segments; The retrieval-based speech conversion model is iteratively trained using the preprocessed target human voice audio data until the model converges or reaches a preset evaluation criterion.
[0009] In an optional embodiment of the first aspect of the present invention, the step of mixing and synthesizing the altered vocal track with the accompaniment track to generate a cover song audio file includes: The vocal track and the accompaniment track are time-aligned to ensure rhythmic synchronization between the vocals and the accompaniment. Based on the root mean square value of the audio energy of the voice-changing track and the accompaniment track, automatic volume balancing is performed to ensure the harmony of loudness between the synthesized voice and accompaniment. Generate digital audio files of cover songs that users can play, download, or share online via web pages or mobile applications.
[0010] In an optional embodiment of the first aspect of the present invention, obtaining the authorized original song audio file to be covered includes: Receive authorized original song audio files uploaded by users and ready for cover performance via web or mobile application interfaces.
[0011] In an optional embodiment of the first aspect of the present invention, the step of mixing and synthesizing the altered vocal track with the accompaniment track to generate a cover song audio file includes: The audio file of the cover song is mastered, including loudness and dynamic range control, as well as spectrum equalization optimization, to improve the overall listening experience and quality of the audio.
[0012] A second aspect of the present invention provides an AI online song cover system, the AI online song cover system comprising: The audio acquisition module is used to acquire the original audio files of authorized songs to be covered. The sound source separation module is used to call the sound source separation model to separate the original song audio file into an accompaniment track and an original vocal track. The timbre conversion module is used to input the original human voice track into a pre-trained retrieval-based speech conversion model, convert the timbre of the original human voice track into the target timbre, and obtain a voice-changing human voice track; The mixing and synthesis module is used to mix and synthesize the voice-changing vocal track with the accompaniment track to generate an audio file of a cover song.
[0013] A third aspect of the present invention provides an AI online song cover singing device, the AI online song cover singing device comprising: a memory and at least one processor, the memory storing instructions, and the memory and the at least one processor being interconnected via a circuit; The at least one processor invokes the instructions in the memory to cause the AI online song cover device to perform the AI online song cover method as described in any one of the first aspects of the present invention.
[0014] A fourth aspect of the present invention provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the AI online song cover method as described in any one of the first aspects of the present invention.
[0015] Beneficial Effects: This invention provides an AI-powered online song cover method, system, and related equipment. The method includes acquiring an authorized original song audio file to be covered; calling a sound source separation model to separate the original song audio file into an accompaniment track and an original vocal track; inputting the original vocal track into a pre-trained retrieval-based speech conversion model to convert the timbre of the original vocal track into a target timbre, obtaining a voice-changing vocal track; and mixing and synthesizing the voice-changing vocal track with the accompaniment track to generate a cover song audio file. This invention's AI-powered online song cover method integrates sound source separation, timbre conversion, and automatic mixing into a unified AI model processing flow, solving the technical problems of cumbersome operation and fragmented processes in existing song cover techniques, and lowering the creative threshold for users. Attached Figure Description
[0016] Figure 1 This is a schematic diagram of an embodiment of the AI online song cover method of the present invention; Figure 2 This is a schematic diagram of an embodiment of the system interaction architecture of an AI online song cover method according to the present invention; Figure 3This is a schematic diagram of an embodiment of the AI online song cover system of the present invention; Figure 4 This is a schematic diagram of one embodiment of an AI-powered online song cover device according to the present invention. Detailed Implementation
[0017] The terms "first," "second," "third," "fourth," etc. (if present) in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" or "having" and any variations thereof are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0018] The first aspect of this invention provides an AI-powered online song cover method that can be deployed in a client / server architecture. The client can be a web browser or a mobile application, responsible for user interaction, file uploading, and result display. The server acts as the backend, responsible for executing all core audio processing computation tasks, including audio separation, model training, speech conversion, mixing and synthesis, and mastering.
[0019] See Figure 1 The AI-powered online song cover method includes: S100. Obtain the authorized original song audio file for cover performance. In this invention, exemplarily, a user can access the online service of this invention through a client (e.g., a webpage or application). On the interactive interface, the user clicks the upload song button and selects an authorized (with permission from the song's copyright holder) original song audio file for cover performance from their local device. The format of the original song audio file may include MP3, WAV, and M4A. That is, in an optional embodiment of the first aspect of this invention, obtaining the authorized original song audio file for cover performance includes: receiving the authorized original song audio file uploaded by the user through a webpage or mobile application interface.
[0020] After receiving the original song audio file, the server can perform some basic preprocessing to ensure the quality of the song file. The basic preprocessing includes format verification to ensure that the file is a supported audio format; parameter unification to unify the audio sampling rate to 48kHz for easy subsequent model processing; and volume normalization to standardize the overall audio volume to prevent the processing effect from being affected by the input volume being too high or too low.
[0021] S200: The sound source separation model is invoked to separate the original song audio file into an accompaniment track and an original vocal track. After step S100, the system automatically invokes the sound source separation model to separate the accompaniment track and vocal track of the song file. The sound source separation model can adopt UVR (Ultimate Vocal Remover) technology based on deep learning, and selects the HP5 vocal separation model, which is specially optimized for karaoke scenarios. The model parameter configuration can be as follows: Segment Size is set to 256 to achieve high-precision analysis; Overlap is set to 0.25 to prevent popping or unnaturalness at the audio block splicing points; Aggression is set to 8 to achieve a balance between thoroughness and naturalness of separation; Enable TTA is set to the on state to improve the stability of the separation result; Post Process is set to the on state to further reduce the residue of accompaniment in the vocal track. After processing, the system outputs two independent audio tracks.
[0022] In this invention, to ensure the quality of subsequent timbre conversion, the system can further perform sound quality enhancement processing on the separated vocal track. For example, it can call third-party APIs such as Auphonic or self-developed sound quality enhancement algorithms to optimize the vocals by denoising and de-reverberating, making the vocals cleaner and clearer. A signal-to-noise ratio (SNR) quality assessment mechanism is also introduced, which quantifies the purity of the vocals by calculating the energy ratio of the vocal signal to the background noise. For example, if the SNR is lower than a preset threshold (e.g., 20dB), the system can prompt the user to change to a higher quality audio source to ensure the final effect. Specifically, in an optional embodiment of the first aspect of this invention, before proceeding to the subsequent step S300, the system includes: performing sound quality enhancement processing on the original vocal track, and performing a signal-to-noise ratio quality assessment on the enhanced original vocal track to determine whether the enhanced original vocal track meets a preset clarity standard.
[0023] S300: Input the original vocal track into a pre-trained retrieval-based speech conversion model to convert the timbre of the original vocal track into the target timbre, thereby obtaining a distorted vocal track. In this invention, before performing a cover song, it is necessary to obtain the timbre of the vocal track. This is done by processing a pre-trained retrieval-based speech conversion model. After the user pre-trains the model, it can be saved to a public timbre library for the user to choose from. The pre-training method of the retrieval-based speech conversion model may include: Obtain target human voice audio data for training; the target human voice audio data for training can come from dry audio data uploaded with the user's authorization, existing legally authorized timbre material libraries, or publicly permitted datasets.
[0024] The target human voice audio data is preprocessed, including unifying the sampling rate, audio segmentation, and filtering invalid audio segments. In this step, the system will perform strict preprocessing on the uploaded target human voice audio data, including unifying it to PCM format, automatically segmenting it into several short segments according to the silence segment, filtering out segments that are too long or too short, and detecting and removing invalid segments with pops or long silences (filtered out through energy and silence detection algorithms).
[0025] The retrieval-based speech conversion model is iteratively trained using the preprocessed target human voice audio data until the model converges or reaches a preset evaluation criterion. In this step, the system divides the preprocessed human voice audio data into a training set and a validation set at a preset ratio (e.g., 9:1). The retrieval-based speech conversion (RVC) model is then iteratively trained. During training, the system monitors the loss value on the validation set. Training terminates when the loss value no longer decreases significantly or reaches the preset upper limit of training epochs, and the trained retrieval-based speech conversion model is then saved.
[0026] Phonogram conversion is one of the core aspects of this invention. In an optional embodiment of step S300 of this invention, inputting the original vocal track into a pre-trained retrieval-based speech conversion model to convert the timbre of the original vocal track into the target timbre to obtain a voice-changing vocal track includes: S301. Extract speech content features from the original vocal track. The speech content features include phonemes and prosodic information. In this step, the system first performs a short-time Fourier transform (STFT) on the input original vocal track to convert it from the time domain to the frequency domain. Then, it extracts acoustic features such as the Mel spectrum (simulating human hearing characteristics, with 128 Mel filters and a frequency range of 40Hz to 16000Hz) and pitch curve (F0 curve, representing the melody being sung, with an extraction range of 50Hz to 1100Hz). Next, it maps these extracted acoustic features to an abstract speech content representation space through a pre-trained content encoder (such as a CNN+Transformer structure). It retains the singing content information such as phonemes, rhythm, and pitch, while stripping away the original singer's timbre information, i.e., only retaining the content of "what was sung" and "how it was sung (pitch, rhythm, etc.)".
[0027] S302. In the retrieval-based speech conversion model, the corresponding target timbre features are retrieved and matched based on the speech content features. In this step, the system uses the extracted speech content features to perform efficient retrieval in the target timbre model (i.e., the retrieval-based speech conversion model) to find the target timbre feature that best matches the current content. The retrieval mechanism can be to find the target timbre feature that best matches the current content by calculating similarity. The specific retrieval parameters can be set to Top-K=8, similarity threshold=0.20, and the fusion method is to use the Top-K results as a weighted average based on similarity.
[0028] S303. The speech content features are fused with the target timbre features, and a voice-changing track is synthesized using a decoder. In this step, the system fuses the speech content features of the original audio file with the retrieved and matched target timbre features, and then sends the fused features to a decoder (e.g., a HiFi-GAN type generative adversarial network decoder). The decoder resynthesizes the fused features into a high-quality speech waveform (i.e., the converted voice-changing track). Optionally, after the voice track is synthesized, the system also analyzes the pitch trajectory in the newly synthesized voice track to detect and smooth any possible local abnormal jitter or abrupt changes, in order to ensure the smoothness and naturalness of the subsequently synthesized singing voice.
[0029] S400: Mix and synthesize the altered vocal track with the accompaniment track to generate a cover song audio file. After generating the altered vocal track, the system enters the final synthesis stage. The system calls a mixing module (e.g., using the PyDub library) to merge the altered vocal track with the accompaniment track separated in step S200. In an optional embodiment of the present invention, step S400 may specifically include: S401. Time alignment processing is performed on the voice-changing track and the accompaniment track to ensure that the voice and accompaniment are synchronized in rhythm. In this step, in order to prevent misalignment, the system will perform precise alignment based on the start timestamps of the two tracks and perform precise trimming or filling in silence to avoid misalignment, delay or rushing of the voice and accompaniment.
[0030] S402. Based on the root mean square (RMS) values of the audio energy of the vocal track and the accompaniment track, automatic volume balancing is performed to ensure the harmony of loudness between the synthesized vocals and accompaniment. In this step, the system calculates the RMS values of the two tracks separately and automatically adjusts their respective gains according to the target loudness ratio (e.g., the vocals are 1-3 dB higher than the accompaniment) to make the vocals clear and prominent and blend naturally with the accompaniment.
[0031] S403. Generate a digital audio file of the cover song that users can play, download, or share online via a webpage or mobile application. In this invention, after the cover song is generated, the system encodes it into a common digital audio format, such as high-bitrate MP3 and lossless WAV, and returns the address of the generated audio file to the client. Users can then play it online directly on a webpage or application, download it locally, or generate a shareable link to send to friends.
[0032] In an optional embodiment of the first aspect of the present invention, in order to improve the listening experience of the final product, after mixing, the system will also perform automated mastering processing. That is, after mixing and synthesizing the vocal track and the accompaniment track to generate the cover song audio file, the system will perform mastering processing on the cover song audio file. The mastering processing includes loudness and dynamic range control (suppressing abnormal peaks in the audio and increasing the overall loudness to a standard level through compressors and limiters) and spectrum equalization optimization (automatically adjusting the low-frequency, mid-frequency and high-frequency distribution of the audio through a multi-band equalizer (EQ) to make the listening experience more balanced and reduce the muddiness that may be caused by mixing), thereby improving the overall listening experience and quality of the audio.
[0033] To facilitate understanding of the technical solution for AI-powered online song cover singing in this invention, in conjunction with... Figure 2 The main points of the AI online song cover technical solution of this invention, based on its interactive architecture, can be summarized as follows: (1) User audio input and separation: 1. Users upload the audio file to be covered via a web page. The audio file can be in formats such as MP3, WAV, FLAC, M4a, etc. 2. The system performs basic preprocessing on the input audio, including format verification, file size and duration, sampling rate standardization, and volume normalization; 3. The system calls the open-source UVR (Ultimate Vocal Remover) technology and uses the HP5 (karaoke) vocal separation model to separate the input song audio into a vocal track and a background instrumental track. The basic parameters of the model are configured as follows: Segment Size is 256 (high precision), Overlap is 0.25 (to prevent splicing pops), Aggression is 8 (naturalness), Enable TTA is enabled (to improve stability), and PostProcess is enabled (to reduce accompaniment residue).
[0034] 4. After obtaining the vocal track, the system calls the audio post-processing and sound quality enhancement module, uses Auphonic (sound quality enhancement) to further optimize the audio quality of the vocal track, and outputs the enhanced vocal track.
[0035] 5. After the human voice separation process is completed, the system introduces a signal-to-noise ratio audio quality assessment mechanism to calculate the energy ratio of the target speech signal to the background noise in the human voice track, and measure the purity and recognizability of the human voice.
[0036] (2) RVC timbre model training process: 1. Model Training and Selection: a) Training data sources: The system uses user-authorized uploaded target human voice data, existing legally authorized timbre material libraries, and publicly available and permitted datasets as model training input datasets. b) Data Construction and Cleaning: A unified PCM dataset format and sampling rate are used; audio segmentation (silence segmentation) is performed, filtering excessively short / long segments; energy and silence detection are conducted (filtering long silences and popping sounds); the proportions, rules, and random seeds for dividing the training and validation sets are recorded and persisted. The above cleaning and filtering rules are configured parametrically and recorded in the training task log.
[0037] c) Training Process: The system iteratively trains the RVC model based on the training samples, enabling the model to learn the acoustic features and timbre distribution of the target timbre. The training process is capped at a preset number of training rounds or iterations, and the performance on the validation set is used as the termination condition.
[0038] d) During or after training, the system evaluates the model performance based on the validation set and selects the model parameters that meet the preset quality requirements as the target timbre model for subsequent speech conversion inference.
[0039] (3) Pitch variation using the RVC target timbre model: 1. The system inputs the separated vocal track into the RVC (Retrieval-based Voice Conversion) model, which is based on a deep learning-based retrieval-based speech conversion method; 2. Feature Extraction: Perform a Short Time Fourier Transform (STFT, window function = Hann, window length = 2048, n_fft = 2048, hop_length = 512) on the input human voice to extract features such as Mel spectrum (Mel recorder, mel_bins = 128, f_min = 40Hz, f_max = 16000Hz), pitch curve (F0 range 50Hz-1100Hz), and energy. 3. Feature Encoding: The acoustic features are mapped to the speech content representation space using a pre-trained encoder (Content Encoder, implemented by a CNN+Transformer temporal network with 256 input and output dimensions) to preserve phoneme and speech content information. 4. Timbre Retrieval and Matching: Timbres are stored in the user-trained RVC timbre model. Through the retrieval mechanism of the RVC model, the target timbre selected by the user is matched with the corresponding timbre features (retrieval and fusion are fixed parameters, Top-K=8, similarity threshold=0.20, and the fusion method is Top-K weighted average with similarity normalization). 5. Decoding and Synthesis: The content features are combined with the target timbre features, and the decoder (HiFi-GAN-like structure) outputs a new speech waveform, thereby achieving timbre conversion. After synthesis, the pitch trajectory of the speech is analyzed to detect local abnormal jitter, abrupt changes, or discontinuities.
[0040] (4) Audio synthesis and output: 1. The system uses Pydub (overlay mixing) technology to remix the altered vocal track with the previously saved accompaniment track. During the mixing process, the system performs volume balancing and gain control on the vocal and accompaniment tracks respectively to ensure overall harmony in sound. Volume balancing is calculated based on the root mean square (RMS) value, and track alignment is performed, including but not limited to start time alignment, duration trimming, or padding, to avoid misalignment, delay, or advance of the vocals and accompaniment, thus ensuring rhythmic and structural consistency and forming a complete song audio file. 2. The system uses loudness and dynamic range control technology to perform overall loudness analysis and dynamic range control on the synthesized audio, suppressing abnormal peaks and balancing intensity variations. It also employs spectrum equalization optimization technology to adjust the distribution of low, mid, and high frequencies, reducing potential muddiness or harshness after mixing and further improving the quality and listening experience of the final audio.
[0041] 3. Output in common formats such as MP3 (CBR encoding mode, 320kbps bit rate, 48000Hz sampling rate, compression strategy prioritizes high-quality encoding and reduces distortion) and WAV (PCM encoding format, 16-bit bit depth, 48000Hz sampling rate, uncompressed) for users to download or share.
[0042] See Figure 3 The second aspect of the present invention provides an AI online song cover system, the AI online song cover system comprising: The audio acquisition module 10 is used to acquire the original audio files of the songs that have been authorized for cover. The sound source separation module 20 is used to call the sound source separation model to separate the original song audio file into an accompaniment track and an original vocal track; The timbre conversion module 30 is used to input the original human voice track into a pre-trained retrieval-based speech conversion model, convert the timbre of the original human voice track into the target timbre, and obtain a voice-changing human voice track; The mixing and synthesis module 40 is used to mix and synthesize the voice-changing vocal track with the accompaniment track to generate an audio file of a cover song.
[0043] In an optional embodiment of the second aspect of the present invention, the AI online song cover system further includes: The audio quality enhancement module is used to enhance the audio quality of the original vocal track and to evaluate the signal-to-noise ratio of the enhanced original vocal track to determine whether the enhanced original vocal track meets the preset clarity standard. In an optional embodiment of the second aspect of the present invention, the timbre conversion module includes: The speech feature extraction unit is used to extract speech content features from the original human voice track, wherein the speech content features include phonemes and prosodic information of the speech. The target timbre retrieval unit is used to retrieve and match the corresponding target timbre features based on the speech content features in the retrieval-based speech conversion model. The timbre feature fusion unit is used to fuse the speech content features with the target timbre features and synthesize a voice-changing track through a decoder.
[0044] In an optional embodiment of the second aspect of the present invention, the pre-training method of the retrieval-based speech conversion model includes: Acquire the target human voice audio data for training; The target human voice audio data is preprocessed, including unifying the sampling rate, audio segmentation, and filtering invalid audio segments; The retrieval-based speech conversion model is iteratively trained using the preprocessed target human voice audio data until the model converges or reaches a preset evaluation criterion.
[0045] In an optional embodiment of the second aspect of the present invention, the mixing and synthesis module includes: The track alignment unit is used to perform time alignment processing on the voice-changing track and the accompaniment track to ensure rhythmic synchronization between the voice and the accompaniment. The volume balancing unit is used to perform automatic volume balancing based on the root mean square value of the audio energy of the voice-changing vocal track and the accompaniment track to ensure the harmony of loudness between the synthesized vocals and accompaniment. The file generation unit is used to generate digital audio files of cover songs that users can play, download, or share online via web pages or mobile applications.
[0046] In an optional embodiment of the second aspect of the present invention, the audio acquisition module includes: The file upload unit is used to receive authorized original song audio files for cover performance uploaded by users via a web page or mobile application interface.
[0047] In an optional embodiment of the second aspect of the present invention, the AI online song cover system further includes: The mastering module is used to perform mastering processing on the cover song audio file. The mastering processing includes loudness and dynamic range control, as well as spectrum equalization optimization, to improve the overall listening experience and quality of the audio.
[0048] Figure 4 This is a schematic diagram of the structure of an AI online song cover device provided in an embodiment of the present invention. This AI online song cover device can vary significantly due to differences in configuration or performance, and may include one or more processors 50 (central processing units, CPUs) (e.g., one or more processors) and memory 60, and one or more storage media 70 (e.g., one or more mass storage devices) for storing applications or data. The memory and storage media can be short-term or persistent storage. The program stored in the storage media may include one or more modules (not shown in the diagram), each module may include a series of instruction operations on the AI online song cover device. Furthermore, the processor may be configured to communicate with the storage media and execute the series of instruction operations stored in the storage media on the AI online song cover device.
[0049] The AI online song cover device of this invention may also include one or more power supplies 80, one or more wired or wireless network interfaces 90, one or more input / output interfaces 100, and / or one or more operating systems, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art will understand that... Figure 4 The structure of the AI online song cover device shown does not constitute a limitation on the AI online song cover device. It may include more or fewer components than shown, or combine certain components, or have different component arrangements.
[0050] The present invention also provides a computer-readable storage medium, which can be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium, wherein the computer-readable storage medium stores instructions that, when the instructions are executed on a computer, cause the computer to perform the steps of the AI online song cover method.
[0051] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working process of the system or system / unit described above can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.
[0052] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0053] The above-described embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A method for AI-powered online song cover singing, characterized in that, include: Obtain the original audio file of the song that has been authorized for cover; The sound source separation model is invoked to separate the original song audio file into an accompaniment track and an original vocal track; The original human voice track is input into a pre-trained retrieval-based speech conversion model, and the timbre of the original human voice track is converted into the target timbre to obtain a voice-changing human voice track. The altered vocal track and the accompaniment track are mixed and synthesized to generate an audio file of the cover song.
2. The AI-powered online song cover method according to claim 1, characterized in that, The process of inputting the original human voice track into the pre-trained retrieval-based speech conversion model includes: The original vocal track is subjected to sound quality enhancement processing, and the signal-to-noise ratio quality of the enhanced original vocal track is evaluated to determine whether the enhanced original vocal track meets the preset clarity standard.
3. The AI-powered online song cover method according to claim 1, characterized in that, The step of inputting the original vocal track into a pre-trained retrieval-based speech conversion model to convert the timbre of the original vocal track into the target timbre, thereby obtaining a voice-changing vocal track, includes: Speech content features are extracted from the original human voice track, and the speech content features include phonemes and prosodic information of the speech. In the retrieval-based speech conversion model, the corresponding target timbre features are retrieved and matched based on the speech content features; The speech content features are fused with the target timbre features, and a voice-changing track is synthesized using a decoder.
4. The AI-powered online song cover method according to claim 1, characterized in that, The pre-training method for the retrieval-based speech conversion model includes: Acquire the target human voice audio data for training; The target human voice audio data is preprocessed, including unifying the sampling rate, audio segmentation, and filtering invalid audio segments; The retrieval-based speech conversion model is iteratively trained using the preprocessed target human voice audio data until the model converges or reaches a preset evaluation criterion.
5. The AI-powered online song cover method according to claim 1, characterized in that, The step of mixing and synthesizing the voice-changing vocal track with the accompaniment track to generate a cover song audio file includes: The vocal track and the accompaniment track are time-aligned to ensure rhythmic synchronization between the vocals and the accompaniment. Based on the root mean square value of the audio energy of the voice-changing track and the accompaniment track, automatic volume balancing is performed to ensure the harmony of loudness between the synthesized voice and accompaniment. Generate digital audio files of cover songs that users can play, download, or share online via web pages or mobile applications.
6. The AI-powered online song cover method according to claim 1, characterized in that, The process of obtaining the authorized original song audio file for cover recording includes: Receive authorized original song audio files uploaded by users and ready for cover performance via web or mobile application interfaces.
7. The AI-powered online song cover method according to claim 1, characterized in that, The step of mixing and synthesizing the voice-changing vocal track with the accompaniment track to generate a cover song audio file includes: The audio file of the cover song is mastered, including loudness and dynamic range control, as well as spectrum equalization optimization, to improve the overall listening experience and quality of the audio.
8. An AI-powered online song cover system, characterized in that, The AI-powered online song cover system includes: The audio acquisition module is used to acquire the original audio files of authorized songs to be covered. The sound source separation module is used to call the sound source separation model to separate the original song audio file into an accompaniment track and an original vocal track. The timbre conversion module is used to input the original human voice track into a pre-trained retrieval-based speech conversion model, convert the timbre of the original human voice track into the target timbre, and obtain a voice-changing human voice track; The mixing and synthesis module is used to mix and synthesize the voice-changing vocal track with the accompaniment track to generate an audio file of a cover song.
9. An AI-powered online song cover device, characterized in that, The AI online song cover device includes: a memory and at least one processor, wherein the memory stores instructions, and the memory and the at least one processor are interconnected via a circuit; The at least one processor invokes the instructions in the memory to cause the AI online song cover device to perform the AI online song cover method as described in any one of claims 1-7.
10. A computer-readable storage medium storing a computer program thereon, characterized in that, When the computer program is executed by a processor, it implements the AI online song cover method as described in any one of claims 1-7.