Speech generation model training method, speech generation method, device, computer equipment, storage medium and computer program product

By using a language processing model fine-tuned for podcast scenarios to remove target semantic words during speech generation model training, and combining sample podcast text and audio for iterative training, the problem of insufficient customization for podcast scenarios in existing technologies is solved, thereby improving the training quality and adaptability of speech generation models.

CN122245285APending Publication Date: 2026-06-19TENCENT MUSIC ENTERTAINMENT TECH (SHENZHEN) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
TENCENT MUSIC ENTERTAINMENT TECH (SHENZHEN) CO LTD
Filing Date
2026-03-19
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies struggle to customize speech generation models for the colloquial expressions and music-related terminology used in podcast scenarios, resulting in low training quality.

Method used

By obtaining the initial podcast text corresponding to the sample podcast audio in the podcast scenario, the target semantic words are removed using a language processing model finely tuned for the podcast scenario to obtain the sample podcast text. Based on the sample podcast text and audio, iterative training is performed to obtain the trained podcast speech generation model.

🎯Benefits of technology

It improves the training quality of the speech generation model, enhances the ability of spoken expression and scene adaptability, improves the accuracy of musical terminology pronunciation and scene adaptability of synthesized speech, and enhances the customization effect of TTS language model.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245285A_ABST
    Figure CN122245285A_ABST
Patent Text Reader

Abstract

This application relates to a speech generation model training method, speech generation method, apparatus, computer device, storage medium, and computer program product. The method includes: obtaining initial podcast text corresponding to sample podcast audio in a podcast scenario; deleting target semantic words from the initial podcast text using a language processing model fine-tuned for the podcast scenario to obtain sample podcast text for a podcast speech generation model to be trained; the target semantic words are used to represent words in the initial podcast text that lack actual semantic information; and iteratively training the podcast speech generation model to be trained based on the sample podcast text and the corresponding sample podcast audio to obtain a trained podcast speech generation model. This method can improve the training quality of the speech generation model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to a speech generation model training method, speech generation method, apparatus, computer device, computer-readable storage medium, and computer program product. Background Technology

[0002] With the widespread adoption of the internet and mobile devices, the demand for audio content consumption has exploded. To ensure the quality of generated audio content, it is crucial to accurately train the audio generation model.

[0003] In traditional techniques, speech generation models are typically trained using general text corpora (such as news articles and novels). However, this approach is difficult to customize for the colloquial expressions, music-related terminology, and contextualized dialogues used in podcasts, resulting in low training quality of the speech generation models. Summary of the Invention

[0004] Therefore, it is necessary to provide a speech generation model training method, speech generation method, device, computer equipment, computer-readable storage medium, and computer program product that can improve the training quality of speech generation models in response to the above-mentioned technical problems.

[0005] Firstly, this application provides a method for training a speech generation model, including:

[0006] Obtain the initial podcast text corresponding to the sample podcast audio in a podcast scenario;

[0007] By fine-tuning the language processing model for the podcast scenario, target semantic words are removed from the initial podcast text to obtain sample podcast text for the podcast speech generation model to be trained; the target semantic words are used to represent words in the initial podcast text that lack actual semantic information.

[0008] Based on the sample podcast text and the corresponding sample podcast audio, the podcast speech generation model to be trained is iteratively trained to obtain the trained podcast speech generation model.

[0009] In one embodiment, obtaining the initial podcast text corresponding to the sample podcast audio in a podcast scenario includes:

[0010] The sample podcast audio in the podcast scenario is segmented to obtain multiple sub-sample podcast audio;

[0011] Each subsample podcast audio is input into the speech recognition model to obtain the target recognition text corresponding to each subsample podcast audio.

[0012] The initial podcast text is obtained based on the target recognition text corresponding to each subsample podcast audio.

[0013] In one embodiment, the step of inputting each subsample podcast audio into a speech recognition model to obtain the target recognition text corresponding to each subsample podcast audio includes:

[0014] For any one of the multiple subsample podcast audios, the subsample podcast audio is input into the speech recognition model to obtain the initial recognition text corresponding to the subsample podcast audio;

[0015] Obtain the text length of the initial recognized text corresponding to the subsample podcast audio, and the audio duration corresponding to each subsample podcast audio;

[0016] If the ratio between the text length of the initial recognition file corresponding to the subsample podcast audio and the audio duration is greater than or equal to the preset ratio required by the podcast scenario, the initial recognition file corresponding to the subsample podcast audio will be used as the target recognition file corresponding to the subsample podcast audio.

[0017] If the ratio between the text length of the initial recognition file corresponding to the subsample podcast audio and the audio duration is less than the preset ratio required by the podcast scenario, the process jumps to the step of inputting the subsample podcast audio into the speech recognition model to obtain the initial recognition text corresponding to the subsample podcast audio, until the ratio between the text length of the initial recognition file corresponding to the subsample podcast audio and the audio duration is greater than or equal to the preset ratio required by the podcast scenario.

[0018] In one embodiment, the step of iteratively training the podcast speech generation model to be trained based on the sample podcast text and the sample podcast audio corresponding to the sample podcast text to obtain the trained podcast speech generation model includes:

[0019] The text feature vector corresponding to the sample podcast text and the scene feature vector corresponding to the podcast scene are fused to obtain the fused feature vector corresponding to the sample podcast text.

[0020] The fused feature vector is input into the podcast voice generation model to be trained to obtain the predicted probability of the sample podcast text under each candidate podcast audio.

[0021] Based on the difference between the candidate podcast audio with the highest predicted probability and the sample audio, the podcast speech generation model to be trained is iteratively trained to obtain the trained podcast speech generation model.

[0022] In one embodiment, the language processing model fine-tuned for the podcast scenario is obtained in the following manner:

[0023] Obtain the first sample podcast text in the podcast scenario, and the first target podcast text corresponding to the first sample podcast text;

[0024] The first sample podcast text is input into the language processing model to be adjusted to obtain the first processed podcast text corresponding to the first sample podcast text.

[0025] Based on the difference between the first processed podcast text and the first target podcast text, the language processing model to be adjusted is fine-tuned to obtain the language processing model fine-tuned for the podcast scenario.

[0026] In one embodiment, before removing target semantic words from the initial podcast text using a language processing model fine-tuned for the podcast scenario to obtain sample podcast text for the podcast speech generation model to be trained, the method further includes:

[0027] Obtain the second sample podcast text in the podcast scenario, and the second target podcast text corresponding to the second sample podcast text;

[0028] The second sample podcast text is input into the fine-tuned language processing model to obtain the second processed podcast text corresponding to the second sample podcast text.

[0029] Based on the second target podcast text and the second processed podcast text, the verification results of the fine-tuned language processing model are obtained;

[0030] The step involves using a language processing model fine-tuned for the podcast scenario to remove target semantic words from the initial podcast text, resulting in sample podcast text for the podcast speech generation model to be trained. This includes:

[0031] If the verification result indicates that the fine-tuned language processing model has passed the verification, the target semantic words in the initial podcast text are deleted by using the fine-tuned language processing model for the podcast scenario to obtain the sample podcast text for the podcast speech generation model to be trained.

[0032] In one embodiment, after removing target semantic words from the initial podcast text using a language processing model fine-tuned for the podcast scenario to obtain sample podcast text for the podcast speech generation model to be trained, the method further includes:

[0033] The sample podcast text is validated to obtain the validation result corresponding to the sample podcast text;

[0034] If the verification result indicates that the sample podcast text fails the verification, the fine-tuned language processing model is fine-tuned again to obtain a finely-tuned language processing model.

[0035] Secondly, this application also provides a speech generation method, including:

[0036] Obtain the podcast text to be analyzed in a podcast context;

[0037] The podcast text to be analyzed is input into the trained podcast speech generation model corresponding to the podcast scene to obtain the target podcast audio corresponding to the podcast text to be analyzed; the trained podcast speech generation model is trained by the speech generation model training method.

[0038] Thirdly, this application also provides a speech generation model training device, comprising:

[0039] The sample acquisition module is used to acquire the initial podcast text corresponding to sample podcast audio in a podcast scenario.

[0040] The sample processing module is used to remove target semantic words from the initial podcast text using a language processing model fine-tuned for the podcast scenario, thereby obtaining sample podcast text for the podcast speech generation model to be trained; the target semantic words are used to represent words in the initial podcast text that lack actual semantic information.

[0041] The model training module is used to iteratively train the podcast speech generation model to be trained based on the sample podcast text and the sample podcast audio corresponding to the sample podcast text, so as to obtain the trained speech generation model.

[0042] Fourthly, this application also provides a speech generation apparatus, comprising:

[0043] The text acquisition module is used to acquire podcast text to be analyzed in podcast scenarios.

[0044] The speech generation module is used to input the podcast text to be analyzed into the trained podcast speech generation model to obtain the target podcast audio corresponding to the podcast text to be analyzed; the trained podcast speech generation model is trained by the speech generation model training method.

[0045] Fifthly, this application also provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to perform the following steps:

[0046] Obtain the initial podcast text corresponding to the sample podcast audio in a podcast scenario;

[0047] By fine-tuning the language processing model for the podcast scenario, target semantic words are removed from the initial podcast text to obtain sample podcast text for the podcast speech generation model to be trained; the target semantic words are used to represent words in the initial podcast text that lack actual semantic information.

[0048] Based on the sample podcast text and the corresponding sample podcast audio, the podcast speech generation model to be trained is iteratively trained to obtain the trained podcast speech generation model.

[0049] Sixthly, this application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, performs the following steps:

[0050] Obtain the initial podcast text corresponding to the sample podcast audio in a podcast scenario;

[0051] By fine-tuning the language processing model for the podcast scenario, target semantic words are removed from the initial podcast text to obtain sample podcast text for the podcast speech generation model to be trained; the target semantic words are used to represent words in the initial podcast text that lack actual semantic information.

[0052] Based on the sample podcast text and the corresponding sample podcast audio, the podcast speech generation model to be trained is iteratively trained to obtain the trained podcast speech generation model.

[0053] In a seventh aspect, this application also provides a computer program product, including a computer program that, when executed by a processor, performs the following steps:

[0054] Obtain the initial podcast text corresponding to the sample podcast audio in a podcast scenario;

[0055] By fine-tuning the language processing model for the podcast scenario, target semantic words are removed from the initial podcast text to obtain sample podcast text for the podcast speech generation model to be trained; the target semantic words are used to represent words in the initial podcast text that lack actual semantic information.

[0056] Based on the sample podcast text and the corresponding sample podcast audio, the podcast speech generation model to be trained is iteratively trained to obtain the trained podcast speech generation model.

[0057] The aforementioned speech generation model training method, apparatus, computer equipment, storage medium, and computer program product first obtain the initial podcast text corresponding to sample podcast audio in a podcast scenario. Then, using a language processing model fine-tuned for the podcast scenario, target semantic words are removed from the initial podcast text to obtain sample podcast text for the podcast speech generation model to be trained. Target semantic words are used to represent words in the initial podcast text that lack actual semantic information. Finally, based on the sample podcast text and the corresponding sample podcast audio, the podcast speech generation model to be trained is iteratively trained to obtain the trained podcast speech generation model corresponding to the podcast scenario.

[0058] The beneficial effects of this application are as follows: By obtaining the initial podcast text corresponding to sample podcast audio in a podcast scenario, and using a language processing model finely tuned for the podcast scenario, words lacking actual semantic information in the initial podcast text can be accurately removed, thereby obtaining sample podcast text that retains the core semantics of the text. Furthermore, based on the sample podcast text and the corresponding sample podcast audio, the podcast speech generation model to be trained can be iteratively trained more accurately, enabling the trained podcast speech generation model to learn more targeted and customized colloquial expressions, which is beneficial to improving the training quality of the speech generation model. Attached Figure Description

[0059] To more clearly illustrate the technical solutions in the embodiments of this application or related technologies, the drawings used in the description of the embodiments of this application or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0060] Figure 1 This is a flowchart illustrating an existing podcast generation scheme in one embodiment;

[0061] Figure 2 This is a flowchart illustrating a speech generation model training method in one embodiment;

[0062] Figure 3 This is a flowchart illustrating the speech generation model training method in another embodiment;

[0063] Figure 4 This is a flowchart illustrating a speech generation method in one embodiment;

[0064] Figure 5 This is a flowchart illustrating the speech generation model training method in yet another embodiment;

[0065] Figure 6This is a structural block diagram of a speech generation model training device in one embodiment;

[0066] Figure 7 This is a structural block diagram of a speech generation device in one embodiment;

[0067] Figure 8 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation

[0068] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0069] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data must comply with relevant regulations.

[0070] Currently, existing podcast generation solutions such as Figure 1 As shown, the process is implemented through the following steps: Step 1, music-related information collection: Using a playlist as the initial input, an AI model is used to collect and organize music information into "Clean Text" (clean, structured text material) by combining music reviews and song background information; Step 2, script generation: Based on this material, AI generates an initial text script according to the requirements of a "conversational podcast script," i.e., the text fragment on the right (with a timestamp); Step 3, script polishing: The AI ​​model further polishes the script to make it more vivid and everyday, adapting it to the conversational expression of a podcast; Step 4, podcast audio generation: The "TTS (Text-to-Speech) - QinYu v4.2" speech synthesis model is used to convert the polished script into speech, and then combined with music from the music library to synthesize the final podcast audio, demonstrating its conversational podcast effect. The entire process achieves end-to-end AI (Artificial Intelligence) automated production from music information to finished podcast audio. The personalized changes for conversational style are all done during the text generation stage, where TTS synthesizes the audio entirely according to the input text. This makes the generated effect extremely dependent on the content of the generated text. When the prompts input to the large language model are consistent, it is easy to have serious homogenization of the generated text each time. In addition, if some very colloquial words such as "um," "ah," and "oh" cannot be generated, the TTS synthesis effect will be greatly reduced.

[0071] In addition, high-quality text-to-audio pairing data is a core factor determining the performance of the language model in a TTS system during training. It needs to learn the correspondence between text sequences and speech sequences to model the "text-to-speech generation logic," and its output quality directly affects the fluency, semantic coherence, and scene adaptability of the synthesized TTS speech.

[0072] Existing TTS language model training solutions suffer from the following technical challenges:

[0073] 1. Text data quality is difficult to guarantee: Podcast audio often contains meaningless semantic words (such as interjections like "um" and "ah", repeated phrases like "where were we just talking about?", and invalid text corresponding to blurred speech caused by background interference). If used directly for training, the TTS language model will learn "redundant semantic mapping relationships", thereby generating synthesized speech with redundant information and incoherent semantics.

[0074] 2. Poor adaptability of language models to business scenarios: Existing solutions mostly use general text corpora (such as news and novels) to train language models, without customizing and optimizing them for podcast scenarios such as "colloquial expressions, music-related terms (such as "chords" and "melody"), and scenario-based dialogue (such as "Next, we will interpret this folk song for you"), resulting in low adaptability of TTS synthesized speech to podcast business scenarios.

[0075] Therefore, this application provides a method for training a speech generation model, which can solve the problems mentioned above and improve the training quality of the speech generation model.

[0076] In one exemplary embodiment, such as Figure 1 As shown, a method for training a speech generation model is provided. This embodiment illustrates the application of this method to a server; it is understood that this method can also be applied to a terminal, or to a system including a terminal and a server, and is implemented through interaction between the terminal and the server. The terminal can be, but is not limited to, various personal computers, laptops, smartphones, and tablets; the server can be a standalone physical server, a server cluster or distributed system consisting of multiple physical servers, or a cloud server providing cloud computing services. In this embodiment, the method includes the following steps:

[0077] Step S201: Obtain the initial podcast text corresponding to the sample podcast audio in the podcast scenario.

[0078] Among them, the podcast scenario, also known as the podcast business scenario, is used to refer to content scenarios with conversational audio programs as the core. Its characteristics are conversational expression, the inclusion of interjections / filler words, and adaptation to the listener's auditory experience (such as music interpretation and casual audio programs).

[0079] Among them, sample podcast audio refers to existing actual podcast audio materials in the podcast scenario (such as the recording of a certain music podcast episode), which is used to represent the podcast audio used for iterative training of the podcast speech generation model to be trained.

[0080] The initial podcast text refers to the text content obtained after performing Automatic Speech Recognition (ASR) on the sample podcast audio.

[0081] For example, in response to a model training instruction for a podcast audio generation model to be trained, the server obtains sample podcast audio in a podcast scenario; then, it performs noise suppression and format normalization on the sample podcast audio to obtain preprocessed sample podcast audio; then, it inputs the preprocessed sample podcast audio into the ASR model to obtain the text corresponding to the preprocessed sample podcast audio, which serves as the initial podcast text corresponding to the sample podcast audio in the podcast scenario.

[0082] Step S202: Using a language processing model fine-tuned for the podcast scenario, target semantic words are removed from the initial podcast text to obtain sample podcast text for the podcast speech generation model to be trained.

[0083] The fine-tuned language processing model is also called the fine-tuned large language model.

[0084] Among them, target semantic words are used to represent words in the initial podcast text that lack actual semantic information. They are specifically defined in combination with the characteristics of the podcast scene and serve as the basis for optimization of the large language model. They are also called meaningless semantic words and include the words in Table 1.

[0085] Table 1. Meaningless semantic words

[0086]

[0087] Among them, the podcast speech generation model refers to the network model that converts podcast text into podcast speech, also known as the TTS model or Speech Language Model.

[0088] Among them, the sample podcast text refers to the text obtained by processing the initial podcast text with a fine-tuned language processing model (removing target semantic words).

[0089] For example, the server inputs the initial podcast text into a language processing model fine-tuned for the podcast scenario. The model performs podcast scenario-adapted word segmentation on the initial podcast text, obtaining multiple word segments. Then, semantic encoding converts the text into vector representations, yielding word vectors for each segment. Next, a pre-defined target semantic word list is invoked to retrieve pre-defined semantic words from the list, and word vectors corresponding to these pre-defined semantic words are deleted using the same semantic encoding logic. Finally, based on the word vectors corresponding to each segment and the pre-defined semantic words... The word vectors are used to determine the similarity between each segment and the preset semantic words. From each segment, segments with a similarity greater than the preset similarity (e.g., 0.8) are selected as candidate segments. Then, the candidate words are semantically verified, and non-target words that are homophones with different meanings or similar in form but different meanings are deleted to obtain processed candidate words. These processed candidate words are used as target semantic words in the initial podcast text. Next, the target semantic words in the initial podcast text are deleted using a language processing model fine-tuned for the podcast scenario, resulting in sample podcast text for the podcast speech generation model to be trained.

[0090] For example, see reference. Figure 3 By fine-tuning the language processing model for the podcast scenario, the target semantic words in the initial podcast text "Actually, when a certain singer recorded 'a certain song,' he actually went to the beach for three days and practiced his voice by listening to the sound of the waves every day" were removed, and the sample podcast text "Actually, when a certain singer recorded 'a certain song,' he actually went to the beach for three days and practiced his voice by listening to the sound of the waves every day" was obtained for the podcast speech generation model to be trained.

[0091] It should be noted that, through a language processing model fine-tuned for the podcast scenario, meaningless semantic words are automatically identified and removed, resulting in optimized podcast text. For example, the initial podcast text is: "Um, so, next, I'll be explaining this rock song to you..." <delete> Well< / delete> "Guitar solo part"; Optimized podcast text: "Next, we will explain the guitar solo part of this rock song."

[0092] Step S203: Based on the sample podcast text and the corresponding sample podcast audio, iteratively train the podcast speech generation model to be trained to obtain the trained podcast speech generation model.

[0093] For example, the server inputs sample podcast text into the podcast speech generation model to be trained, and obtains the predicted podcast audio corresponding to the sample podcast text output by the podcast speech generation model to be trained; then, based on the difference between the predicted podcast audio corresponding to the sample podcast text and the sample podcast audio corresponding to the sample podcast text, a loss value is obtained; then, based on the loss value, the podcast speech generation model to be trained is iteratively trained to obtain the trained podcast speech generation model.

[0094] In the above-described speech generation model training method, the initial podcast text corresponding to sample podcast audio in a podcast scenario is first obtained. Then, using a language processing model fine-tuned for the podcast scenario, target semantic words are removed from the initial podcast text to obtain sample podcast text for the podcast speech generation model to be trained. Target semantic words are used to represent words in the initial podcast text that lack actual semantic information. Finally, based on the sample podcast text and the corresponding sample podcast audio, the podcast speech generation model to be trained is iteratively trained to obtain the trained podcast speech generation model corresponding to the podcast scenario. In this way, when training the speech generation model, by obtaining the initial podcast text corresponding to the sample podcast audio in a podcast scenario and using a language processing model fine-tuned for the podcast scenario, words lacking actual semantic information in the initial podcast text can be accurately removed, thus obtaining sample podcast text that retains the core semantics of the text. Furthermore, based on the sample podcast text and the corresponding sample podcast audio, the podcast speech generation model to be trained can be iteratively trained more accurately, enabling the trained podcast speech generation model to learn more targeted and customized colloquial expressions, which is beneficial to improving the training quality of the speech generation model.

[0095] It should be noted that the above-mentioned speech generation model training method also has the following technical effects:

[0096] 1. Ensure the ability to generate conversational TTS: By accurately removing meaningless semantic words through a large language model, the core semantics of the text are preserved, and the TTS language model learns conversational expressions. After professional testing by the evaluation team, the conversational level of the synthesized podcast voice has been improved by 68%.

[0097] 2. Optimize the scene adaptability of the TTS language model: The language model is trained based on optimized text-audio data for podcast scenarios, which improves the accuracy of the pronunciation of music terms in synthesized speech by 7-10 percentage points and the adaptability of scene-based expressions (such as "interpretation" and "introduction") by 25%, making it more in line with the speech synthesis needs of podcast business.

[0098] 3. Highly scalable technical process: The "ASR-large language model-TTS language model" technical chain of this solution can be migrated to other conversational speech scenarios such as audiobooks and voice navigation. Only the scenario-fine-tuning data of ASR and large language model need to be adjusted to quickly adapt to new scenarios, which has good versatility.

[0099] In an exemplary embodiment, step S201 above, obtaining the initial podcast text corresponding to the sample podcast audio in the podcast scenario, specifically includes the following: segmenting the sample podcast audio in the podcast scenario to obtain multiple sub-sample podcast audio; inputting each sub-sample podcast audio into a speech recognition model to obtain the target recognition text corresponding to each sub-sample podcast audio; and obtaining the initial podcast text based on the target recognition text corresponding to each sub-sample podcast audio.

[0100] Among them, subsample podcast audio is used to represent independent audio segments obtained after segmenting the sample podcast audio.

[0101] Among them, target recognition text refers to the final text content obtained after performing speech recognition (ASR) on the subsample podcast audio.

[0102] For example, the server acquires sample podcast audio in WAV (Waveform Audio File Format) / MP3 (MPEG-1 Audio Layer 3, a lossy audio compression format) format with a sampling rate of 44.1kHz / 48kHz in a podcast scenario. Then, spectral subtraction is used to remove background noise (such as ambient noise or slight current noise) from the sample podcast audio, preserving the human voice signal to obtain noise-suppressed sample podcast audio. Next, if the audio duration of the noise-suppressed sample podcast audio exceeds a preset audio duration (e.g., more than 10 minutes), the noise-suppressed sample podcast audio is segmented according to its semantic pause points (obtained by detecting the audio energy threshold of the noise-suppressed sample audio; for example, segments with energy below -40dBFS are considered pause points), resulting in multiple 5-10 minute sub-sample audio samples. This avoids semantic disconnections caused by excessively long sequences during ASR recognition. Finally, each sub-sample podcast audio is uniformly converted to 16-bit mono PCM (Pulse) audio. The ASR model uses a code-modulation (Pulse Code Modulation) format to obtain subsample podcast audio in each preset format, ensuring the consistency of the input format. Then, each preset format subsample podcast audio is input into the speech recognition model to obtain the recognition text corresponding to each preset format subsample podcast audio, which is used as the target recognition text corresponding to each subsample audio. Finally, the target recognition text corresponding to each subsample podcast audio is used as the initial podcast text.

[0103] It should be noted that this embodiment uses an end-to-end ASR model adapted to spoken language scenarios, such as the Conformer (Convolution-augmented Transformer, a hybrid model architecture that combines the advantages of convolutional neural networks and Transformers) model based on Transformer. This model is pre-trained on podcast domain corpus (containing 100,000 hours of podcast audio-text pairing data) and has the following optimization capabilities:

[0104] 1. Supports the recognition of colloquial expressions: It can accurately identify colloquial sentence breaks in podcasts (such as the pauses after "right" or "you know what") and ellipsis (such as "the special" in "this song is especially good" which corresponds to "especially").

[0105] 2. Supports music-related terminology recognition: By adding a "music terminology dictionary" (containing 2000+ professional terms such as "arpeggio", "mode", and "mixing") to the pre-training corpus, the terminology recognition accuracy is ≥98%.

[0106] 3. Perform ASR recognition: Input the preprocessed sub-audio into the ASR model and output the initial text file corresponding to each sub-audio. The text file is named "audio ID_sub-segment number.txt" and records the "text-audio timestamp mapping relationship" (e.g., sentences 1-5 in the text correspond to audio segments 00:00-00:20) for subsequent text-audio pairing verification.

[0107] In this embodiment, by selectively collecting sample podcast audio from podcast scenarios, performing semantic-first segmentation, and integrating spoken language-adapted speech recognition and text, efficient splitting and accurate transcription of long podcast audio are achieved. This improves the accuracy of text transcription and provides high-quality and scenario-adaptable basic data for subsequent text processing of language processing models and training of podcast speech generation models.

[0108] In an exemplary embodiment, each subsample podcast audio is input into a speech recognition model to obtain the target recognition text corresponding to each subsample podcast audio. Specifically, this includes: for any subsample podcast audio among multiple subsample podcast audios, inputting the subsample podcast audio into the speech recognition model to obtain the initial recognition text corresponding to the subsample podcast audio; obtaining the text length of the initial recognition text corresponding to the subsample podcast audio and the audio duration corresponding to the subsample podcast audio; if the ratio between the text length and audio duration of the initial recognition file corresponding to the subsample podcast audio is greater than or equal to a preset ratio required by the podcast scenario, using the initial recognition file corresponding to the subsample podcast audio as the target recognition file corresponding to the subsample podcast audio; if the ratio between the text length and audio duration of the initial recognition file corresponding to the subsample podcast audio is less than the preset ratio required by the podcast scenario, the process jumps to the step of inputting the subsample podcast audio into the speech recognition model to obtain the initial recognition text corresponding to the subsample podcast audio, until the ratio between the text length and audio duration of the initial recognition file corresponding to the subsample podcast audio is greater than or equal to the preset ratio required by the podcast scenario.

[0109] The initial recognized text refers to the original text content output by the speech recognition model after performing speech recognition on the subsample podcast audio.

[0110] The text length refers to the number of characters (including Chinese characters, punctuation, colloquial interjections, etc.) in the recognized text corresponding to each subsample podcast audio. For example, a 5-minute subsample podcast audio is transcribed into a text of 1200 characters, and its text length is 1200.

[0111] The audio duration is used to represent the effective playback duration of the subsample podcast audio, in seconds (s).

[0112] Among them, the preset ratio refers to the pre-set threshold ratio of text length to audio duration required in podcast scenarios, which is the core standard for judging whether the ASR transcription result is qualified (e.g., 0.5).

[0113] For example, the server, for any one of multiple sub-sample podcast audios, inputs the sub-sample podcast audio into a speech recognition model to obtain the initial recognized text corresponding to the sub-sample podcast audio output by the speech recognition model; obtains the number of characters in the initial recognized text corresponding to the sub-sample podcast audio, and uses this number of characters as the text length of the initial recognized text corresponding to each sub-sample podcast audio; and identifies the audio duration corresponding to each sub-sample podcast audio through a speech activity detection model; then, combining the spoken expression characteristics of the podcast scene, determines a preset ratio required by the podcast scene, and based on the preset ratio required by the podcast scene, calculates the ratio between the text length and audio duration of the initial recognized file corresponding to the sub-sample podcast audio. The system performs a judgment based on the value of the initial recognition file. If the ratio of the text length to the audio duration of the initial recognition file corresponding to the subsample podcast audio is greater than or equal to the preset ratio required by the podcast scenario, the initial recognition file corresponding to the subsample podcast audio is used as the target recognition file corresponding to the subsample podcast audio. If the ratio of the text length to the audio duration of the initial recognition file corresponding to the subsample podcast audio is less than the preset ratio required by the podcast scenario, the system jumps to the step of inputting the subsample podcast audio into the speech recognition model to obtain the initial recognition text corresponding to each subsample podcast audio, until the ratio of the text length to the audio duration of the initial recognition file corresponding to the subsample podcast audio is greater than or equal to the preset ratio required by the podcast scenario.

[0114] For example, a basic length check is performed on the initial recognition file corresponding to the subsample podcast audio generated by ASR. If the ratio of the text length (number of characters) to the audio duration (seconds) of the initial recognition file corresponding to the subsample podcast audio is less than 0.5 (the ratio for normal spoken expression is 1-1.5), it is determined to be a recognition anomaly, and the ASR recognition of the subsample podcast audio is re-executed.

[0115] In this embodiment, an automated quality screening mechanism for speech recognition results in podcast scenarios is established by using the quantitative indicator of the ratio of text length to audio duration. This mechanism can accurately identify problems such as missing transcription content caused by excessive audio noise reduction or insufficient model parameter adaptability, thus avoiding situations where the text content is sparse or key spoken language components are missed.

[0116] In an exemplary embodiment, step S203 above, which iteratively trains the podcast speech generation model to be trained based on the sample podcast text and the corresponding sample podcast audio, to obtain a trained podcast speech generation model, specifically includes the following: fusing the text feature vector corresponding to the sample podcast text and the scene feature vector corresponding to the podcast scene to obtain a fused feature vector corresponding to the sample podcast text; inputting the fused feature vector into the podcast speech generation model to be trained to obtain the predicted probability of the sample podcast text under each candidate podcast audio; and iteratively training the podcast speech generation model to be trained based on the difference between the candidate podcast audio with the highest predicted probability and the sample audio, to obtain a trained podcast speech generation model.

[0117] Among them, the text feature vector is used to represent the representation vector corresponding to the sample text.

[0118] Among them, the scene feature vector is used to represent the representation vector corresponding to the podcast scene.

[0119] Among them, the fused feature vector refers to the feature vector obtained by fusing the text feature vector and the scene feature vector.

[0120] Candidate podcast audio refers to podcast audio to be selected. It should be noted that different candidate podcast audio includes different broadcasters' pronunciation styles, speaking speeds, and tones (such as calm and lively styles), as well as different colloquial expression patterns (such as those with interjections and interactive phrases). It serves as a reference audio library for the podcast speech generation model to make predictions. The model will match the optimal podcast audio from the candidate podcast audio based on the fused feature vectors.

[0121] The prediction probability is used to represent the likelihood that the podcast audio generation model determines the candidate podcast audio to be correct.

[0122] For example, the server extracts text feature vectors corresponding to the sample podcast text using a text feature extraction network; then, it extracts scene feature vectors corresponding to the podcast scene using a scene feature extraction network; next, it inputs the text feature vectors and scene feature vectors into an attention mechanism model to obtain fusion weights for the text feature vectors and scene feature vectors, and then sums the text feature vectors and scene feature vectors according to their respective fusion weights to obtain a fused feature vector corresponding to the sample podcast text; next, it inputs the fused feature vector into the podcast speech generation model to be trained to obtain the predicted probability of the sample podcast text under each candidate podcast audio; then... From the candidate podcast audios, the candidate audio with the highest prediction probability is selected as the predicted podcast audio corresponding to the sample podcast text. Next, the audio feature vectors of the predicted podcast audio and the sample podcast audio are extracted, and the loss value is obtained based on the difference between the audio feature vectors of the predicted podcast audio and the sample podcast audio. Then, based on the loss value, the model parameters of the podcast speech generation model to be trained are adjusted, and the podcast speech generation model with adjusted model parameters is retrained until the loss value obtained by the trained podcast speech generation model is less than the loss value threshold. At this point, training stops, and the trained podcast speech generation model is taken as the completed podcast speech generation model.

[0123] For example, see reference. Figure 3 For the same sample podcast text and the corresponding sample podcast audio (Text-audio pairs), extract the text feature vector (TextEncoder) corresponding to the sample podcast text and the audio feature vector (Audio Encoder) corresponding to the sample podcast audio. Based on the difference between the text feature vector and the audio feature vector, iteratively train the podcast speech generation model to be trained to obtain the trained podcast speech generation model.

[0124] It should be noted that, based on the "text-audio timestamp mapping relationship", each sentence of the optimized podcast text is bound to the corresponding audio segment (located by timestamp) to form a "sentence-level text-audio pairing unit". The proportion of audio segments that are too short (text < 5 characters, audio < 1 second) or too long (text > 100 characters, audio > 10 seconds) is calculated to ensure that the distribution is reasonable and to avoid the language model learning semantic mappings of extreme lengths.

[0125] It should be noted that the TTS system's language model adopts a three-layer architecture of "text encoder - semantic modeling layer - output layer," and is customized for podcast scenarios.

[0126] 1. Text Encoder: ByteLevelBPE (byte-pair encoding) is used as the word segmentation method. The vocabulary contains 20,000+ core words for podcast scenarios (including music terms and colloquial expressions), which will convert the optimized text into a modelable token sequence.

[0127] 2. Semantic Modeling Layer: Based on the Transformer decoder, it learns the semantic dependencies between text tokens through the "self-attention mechanism" (such as the association between "guitar" and "solo", and the association between "folk song" and "guitar accompaniment"), and introduces "podcast scene feature embedding" (such as scene label embedding such as "music interpretation" and "singer introduction"), so that the model can adjust the semantic modeling strategy according to the scene.

[0128] 3. Output layer: Outputs the "predicted probability distribution of voice tokens" corresponding to the text tokens (voice tokens are generated from podcast audio through vector quantization), realizing the semantic mapping model of "text to voice tokens".

[0129] In this embodiment, by extracting the semantic feature vector of the sample podcast text and fusing it with the podcast scene-specific feature vector, the podcast speech generation model learns both the semantics of the text content and the conversational scene attributes of the podcast. At the same time, by using the candidate audio matching and prediction probability screening mechanism, the model accurately locates the speech sample with the highest fit to the sample podcast text, which significantly improves the model's adaptability to podcast scenes.

[0130] In an exemplary embodiment, the language processing model fine-tuned for the podcast scenario is obtained as follows: a first sample podcast text in the podcast scenario and a first target podcast text corresponding to the first sample podcast text are obtained; the first sample podcast text is input into the language processing model to be adjusted to obtain a first processed podcast text corresponding to the first sample podcast text; based on the difference between the first processed podcast text and the first target podcast text, the language processing model to be adjusted is fine-tuned to obtain the language processing model fine-tuned for the podcast scenario.

[0131] The first sample podcast text refers to the podcast text used to fine-tune the language processing model.

[0132] The first target podcast text refers to the reference podcast text that corresponds to the first sample podcast text and meets the expected processing standards.

[0133] Here, the first processed podcast text refers to the processing result corresponding to the first sample podcast text output by the language processing model to be adjusted.

[0134] For example, in response to a model fine-tuning instruction for the language processing model to be adjusted, the server obtains a first sample podcast text in the podcast scenario and a first target podcast text corresponding to the first sample podcast text. Then, the first sample podcast text is input into the language processing model to be adjusted to obtain a first processed podcast text corresponding to the first sample podcast text. Then, the text feature vectors of the first processed podcast text and the first target podcast text are extracted respectively, and a loss value is obtained based on the difference between the text feature vectors of the first processed podcast text and the first target podcast text. Then, based on the loss value, the model fine-tuning parameters of the language processing model to be adjusted are determined, and based on the model fine-tuning parameters, the language processing model to be adjusted is fine-tuned to obtain a language processing model fine-tuned for the podcast scenario.

[0135] For example, a large language model with strong semantic understanding capabilities is selected, and fine-tuned for podcast scenarios in the following ways to enable it to "identify and remove meaningless semantic words":

[0136] 1. Construct a fine-tuning dataset: Collect a certain amount of podcast initial text-optimized text pairing data, where the optimized text is manually labeled with "meaningless semantic words to be removed" (labeling format is as follows). <delete> Meaningless semantic words< / delete> ,like" <delete> Um< / delete> The melody of this folk song is very soothing.

[0137] 2. Fine-tuning task design: The “Instruction Tuning” mode is adopted. The instruction “Please remove meaningless semantic words from the following podcast text and retain the core semantic content: [Initial text]” is input into the large language model. The optimized text with manual annotation is used as the target output. The model is trained to learn the mapping relationship between “meaningless semantic word recognition and text optimization”.

[0138] In this embodiment, by selecting real text samples from podcast scenarios to construct precise training pairs, the language processing model to be adjusted learns the processing logic and standards of podcast conversational text during the targeted fine-tuning process. This results in the final fine-tuned language processing model having a stronger podcast scenario adaptability, providing high-quality and highly matched sample podcast texts for the subsequent training of the speech generation model.

[0139] In an exemplary embodiment, step S202, before deleting target semantic words from the initial podcast text to obtain sample podcast text for the podcast speech generation model to be trained using a language processing model fine-tuned for the podcast scenario, specifically includes the following: obtaining a second sample podcast text in the podcast scenario, and a second target podcast text corresponding to the second sample podcast text; inputting the second sample podcast text into the fine-tuned language processing model to obtain a second processed podcast text corresponding to the second sample podcast text; and obtaining the verification result of the fine-tuned language processing model based on the second target podcast text and the second processed podcast text.

[0140] Therefore, step S202 above, which involves removing target semantic words from the initial podcast text using a language processing model fine-tuned for the podcast scenario, to obtain sample podcast text for the podcast speech generation model to be trained, specifically includes the following: if the verification results indicate that the fine-tuned language processing model has passed verification, the target semantic words are removed from the initial podcast text using a language processing model fine-tuned for the podcast scenario, to obtain sample podcast text for the podcast speech generation model to be trained.

[0141] The second sample podcast text refers to the text used for model validation of the language processing model.

[0142] The second target podcast text refers to the reference podcast text that corresponds to the second sample podcast text and meets the expected processing standards.

[0143] Here, the second processed podcast text refers to the processing result corresponding to the second sample podcast text output by the fine-tuned language processing model.

[0144] The validation results are used to represent the target semantic word removal accuracy and core semantic retention rate of the fine-tuned language processing model.

[0145] For example, in response to a model validation instruction for the fine-tuned language processing model, the server obtains a second sample podcast text in the podcast scenario and a second target podcast text corresponding to the second sample podcast text. Then, the second sample podcast text is input into the fine-tuned language processing model to obtain a second processed podcast text corresponding to the second sample podcast text. Next, based on the second target podcast text, the total number of labeled target semantic words in the second sample podcast text is obtained; based on the second processed podcast text, the number of successfully deleted target semantic words in the second processed podcast text is obtained; the ratio between the number of successfully deleted target semantic words in the second processed podcast text and the total number of labeled target semantic words is used as the target semantic word removal accuracy of the fine-tuned language processing model. Finally, the text feature vectors of the second processed podcast text and the second... The text feature vector of the target podcast text is used, and the cosine similarity between the text feature vector of the second processed podcast text and the text feature vector of the second target podcast text is used as the core semantic retention rate of the fine-tuned language processing model. Then, the target semantic word removal accuracy and core semantic retention rate of the fine-tuned language processing model are both used as the verification results of the fine-tuned language processing model. Next, if the target semantic word removal accuracy is greater than or equal to the preset accuracy and the core semantic retention rate is greater than or equal to the preset semantic retention rate, the verification result is determined to be that the fine-tuned language processing model has passed the verification. If the verification result indicates that the fine-tuned language processing model has passed the verification, the target semantic words in the initial podcast text are deleted by the language processing model fine-tuned for the podcast scenario to obtain the sample podcast text of the podcast speech generation model to be trained.

[0146] For example, the model's performance was verified using a test set (containing 10,000 podcast texts that were not fine-tuned), requiring an accuracy of ≥95% for removing meaningless semantic words and a core semantic retention rate of ≥99% (i.e., without deleting core musical terms such as "chords," "melody," and "mixing").

[0147] In this embodiment, by selecting a second sample podcast text and a corresponding second target podcast text independent of the training set, an objective model validation benchmark is constructed. This benchmark can accurately evaluate the generalization ability of the fine-tuned language processing model on new samples in the podcast scenario, ensuring that only models that meet the performance standards can be used in the subsequent initial text processing process, which is beneficial to ensuring the quality of text generation.

[0148] In an exemplary embodiment, step S202, after deleting target semantic words from the initial podcast text using a language processing model fine-tuned for the podcast scenario to obtain sample podcast text for the podcast speech generation model to be trained, specifically includes the following: verifying the sample podcast text to obtain the verification result corresponding to the sample podcast text; if the verification result indicates that the sample podcast text fails verification, fine-tuning the language processing model again to obtain a language processing model after further fine-tuning.

[0149] The verification result refers to the judgment conclusion obtained by verifying the sample podcast text, including verification passing and verification failing.

[0150] The term "further fine-tuning" refers to the targeted iterative optimization operation performed on the fine-tuned language processing model.

[0151] For example, the server validates the sample podcast text using regular expressions to obtain the validation result corresponding to the sample podcast text. If the validation result indicates that the sample podcast text fails validation, the server obtains the historical model fine-tuning coefficients of the fine-tuned language processing model and uses these historical model fine-tuning coefficients as the current model fine-tuning coefficients of the fine-tuned language processing model. Based on the current model fine-tuning coefficients, the server further fine-tunes the fine-tuned language processing model to obtain the further fine-tuned language processing model.

[0152] For example, a combination of rule-based validation and manual sampling can be used to verify text quality:

[0153] 1. Rule Validation: Use regular expressions to detect whether high-frequency meaningless semantic words such as "um" and "ah" still exist in the optimized podcast text. If they do, re-enter the model for optimization.

[0154] 2. Manual sampling: Randomly select 1% of the optimized podcast text and have the annotation personnel verify the integrity of the core semantics and the effect of removing meaningless semantic words. The pass rate must be ≥98%; otherwise, adjust the parameters of the large language model and re-optimize.

[0155] In this embodiment, by performing quality checks on the sample podcast text, problems such as missing target semantic words, missing core semantics, and accidental deletion of spoken language components in the output text of the fine-tuned language processing model can be detected in a timely manner. At the same time, when the check fails, the model is triggered to fine-tune again, which prevents low-quality sample text from flowing into the training stage of the speech generation model, and is conducive to improving the training effect and stability of the subsequent speech generation model.

[0156] In one exemplary embodiment, such as Figure 4As shown, a speech generation method is provided. Taking the application of this method to a server as an example, the specific steps include:

[0157] Step S401: Obtain the podcast text to be analyzed in the podcast scenario.

[0158] Step S402: Input the podcast text to be analyzed into the trained podcast speech generation model to obtain the target podcast audio corresponding to the podcast text to be analyzed; the trained podcast speech generation model is trained by the speech generation model training method.

[0159] Among them, the podcast text to be analyzed refers to text that meets the requirements of the podcast scenario (such as conforming to the requirements of the podcast's conversational style).

[0160] The target podcast audio refers to the podcast audio output by the trained podcast speech generation model that matches the text to be analyzed.

[0161] For example, the server obtains the podcast text to be analyzed in a podcast context and extracts the text feature vector of the podcast text to be analyzed; then, it obtains the scene feature vector corresponding to the podcast context and fuses the text feature vector of the podcast text to be analyzed and the scene feature vector corresponding to the podcast context to obtain the fused feature vector corresponding to the podcast text to be analyzed; then, it inputs the fused feature vector corresponding to the podcast text to be analyzed into the trained podcast speech generation model to obtain the prediction probability of the podcast text to be analyzed under each preset podcast audio; then, from each preset podcast audio, the preset podcast audio with the highest prediction probability is selected as the target podcast audio corresponding to the podcast text to be analyzed.

[0162] In this embodiment, by inputting pre-optimized podcast text that fits the conversational style of podcasts, and combining it with a podcast speech generation model trained specifically for podcast scenarios, it is possible to directly generate target podcast audio with accurate semantics and a speech style highly adapted to the podcast scenario. This helps to improve the generation efficiency and quality stability of podcast audio content, and provides efficient and reliable technical support for the large-scale production of podcast content.

[0163] In one exemplary embodiment, such as Figure 5 As shown, another method for training a speech generation model is provided. Taking the application of this method to a server as an example, the specific steps include:

[0164] Step S501: The sample podcast audio in the podcast scenario is segmented to obtain multiple sub-sample podcast audio.

[0165] Step S502: Input each subsample podcast audio into the speech recognition model to obtain the target recognition text corresponding to each subsample podcast audio.

[0166] Step S503: Based on the target recognition text corresponding to each subsample podcast audio, obtain the initial podcast text.

[0167] Step S504: Using a language processing model fine-tuned for the podcast scenario, target semantic words are removed from the initial podcast text to obtain sample podcast text for the podcast speech generation model to be trained; target semantic words are used to represent words in the initial podcast text that lack actual semantic information.

[0168] Step S505: The text feature vector corresponding to the sample podcast text and the scene feature vector corresponding to the podcast scene are fused to obtain the fused feature vector corresponding to the sample podcast text.

[0169] Step S506: Input the fused feature vector into the podcast speech generation model to be trained to obtain the predicted probability of the sample podcast text under each candidate podcast audio.

[0170] Step S507: Based on the difference between the candidate podcast audio with the highest predicted probability from each candidate podcast audio and the sample audio, iteratively train the podcast speech generation model to be trained to obtain the trained podcast speech generation model.

[0171] In the above-mentioned speech generation model training method, by obtaining the initial podcast text corresponding to the sample podcast audio in the podcast scenario, and using the language processing model finely tuned for the podcast scenario, words lacking actual semantic information in the initial podcast text can be accurately removed, thereby obtaining sample podcast text that retains the core semantics of the text. Then, based on the sample podcast text and the sample podcast audio corresponding to the sample podcast text, the podcast speech generation model to be trained can be iteratively trained more accurately, so that the trained podcast speech generation model can learn more targeted and customized colloquial expressions, which is conducive to improving the training quality of the speech generation model.

[0172] It should be understood that although the steps in the flowcharts of the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the above embodiments may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.

[0173] Based on the same inventive concept, this application also provides a speech generation model training apparatus for implementing the speech generation model training method described above. The solution provided by this apparatus is similar to the implementation described in the above method; therefore, the specific limitations in one or more speech generation model training apparatus embodiments provided below can be found in the limitations of the speech generation model training method described above, and will not be repeated here.

[0174] In one exemplary embodiment, such as Figure 6 As shown, a speech generation model training device is provided, including: a sample acquisition module 601, a sample processing module 602, and a model training module 603, wherein:

[0175] The sample acquisition module 601 is used to acquire the initial podcast text corresponding to the sample podcast audio in the podcast scenario.

[0176] The sample processing module 602 is used to remove target semantic words from the initial podcast text by using a language processing model fine-tuned for the podcast scenario, so as to obtain sample podcast text for the podcast speech generation model to be trained; the target semantic words are used to represent words in the initial podcast text that lack actual semantic information.

[0177] The model training module 603 is used to iteratively train the podcast speech generation model to be trained based on the sample podcast text and the sample podcast audio corresponding to the sample podcast text, so as to obtain the trained podcast speech generation model.

[0178] In an exemplary embodiment, the sample acquisition module 601 is further configured to segment the sample audio in the podcast scenario to obtain multiple sub-sample podcast audio; input each sub-sample podcast audio into a speech recognition model to obtain the target recognition text corresponding to each sub-sample podcast audio; and obtain the initial podcast text based on the target recognition text corresponding to each sub-sample podcast audio.

[0179] In an exemplary embodiment, the sample acquisition module 601 is further configured to: input the subsample podcast audio into a speech recognition model for any subsample podcast audio among multiple subsample podcast audios to obtain the initial recognition text corresponding to the subsample podcast audio; obtain the text length of the initial recognition text corresponding to the subsample podcast audio and the audio duration corresponding to each subsample podcast audio; if the ratio between the text length and audio duration of the initial recognition file corresponding to the subsample podcast audio is greater than or equal to a preset ratio required by the podcast scenario, use the initial recognition file corresponding to the subsample podcast audio as the target recognition file corresponding to the subsample podcast audio; if the ratio between the text length and audio duration of the initial recognition file corresponding to the subsample podcast audio is less than the preset ratio required by the podcast scenario, jump to the step of inputting the subsample podcast audio into the speech recognition model to obtain the initial recognition text corresponding to the subsample podcast audio, until the ratio between the text length and audio duration of the initial recognition file corresponding to the subsample podcast audio is greater than or equal to the preset ratio required by the podcast scenario.

[0180] In an exemplary embodiment, the model training module 603 is further configured to fuse the text feature vector corresponding to the sample podcast text and the scene feature vector corresponding to the podcast scene to obtain a fused feature vector corresponding to the sample podcast text; input the fused feature vector into the podcast speech generation model to be trained to obtain the predicted probability of the sample podcast text under each candidate podcast audio; and iteratively train the podcast speech generation model to be trained based on the difference between the candidate podcast audio with the highest predicted probability and the sample audio to obtain the trained podcast speech generation model.

[0181] In an exemplary embodiment, the speech generation model training device further includes a model fine-tuning module, which is used to obtain a first sample podcast text in a podcast scenario and a first target podcast text corresponding to the first sample podcast text; input the first sample podcast text into the language processing model to be adjusted to obtain a first processed podcast text corresponding to the first sample podcast text; and fine-tune the language processing model to be adjusted according to the difference between the first processed podcast text and the first target podcast text to obtain a language processing model fine-tuned for the podcast scenario.

[0182] In an exemplary embodiment, the speech generation model training device further includes a model verification module, configured to acquire a second sample podcast text in a podcast scenario and a second target podcast text corresponding to the second sample text; input the second sample podcast text into a fine-tuned language processing model to obtain a second processed podcast text corresponding to the second sample podcast text; and obtain a verification result of the fine-tuned language processing model based on the second target podcast text and the second processed podcast text; the sample processing module 602 is further configured to, if the verification result indicates that the fine-tuned language processing model has passed verification, delete target semantic words in the initial podcast text using the fine-tuned language processing model for the podcast scenario to obtain sample podcast text for the podcast speech generation model to be trained.

[0183] In an exemplary embodiment, the speech generation model training device further includes a model processing module for verifying sample podcast text and obtaining a verification result corresponding to the sample podcast text; if the verification result indicates that the sample podcast text fails verification, the fine-tuned language processing model is fine-tuned again to obtain a fine-tuned language processing model.

[0184] Each module in the aforementioned speech generation model training device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the corresponding operations of each module.

[0185] Based on the same inventive concept, this application also provides a speech generation apparatus for implementing the speech generation method described above. The solution provided by this apparatus is similar to the implementation described in the above method; therefore, the specific limitations in one or more speech generation apparatus embodiments provided below can be found in the limitations of the speech generation method described above, and will not be repeated here.

[0186] In one exemplary embodiment, such as Figure 7 As shown, a speech generation device is provided, including: a text acquisition module 701 and a speech generation module 702, wherein:

[0187] The text acquisition module 701 is used to acquire the podcast text to be analyzed in a podcast scenario.

[0188] The speech generation module 702 is used to input the podcast text to be analyzed into the trained podcast speech generation model corresponding to the podcast scene, and obtain the target podcast audio corresponding to the podcast text to be analyzed; the trained podcast speech generation model is trained by the speech generation model training method.

[0189] Each module in the aforementioned speech generation device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the corresponding operations of each module.

[0190] In one exemplary embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 8 As shown, this computer device includes a processor, memory, input / output (I / O) interfaces, and a communication interface. The processor, memory, and I / O interfaces are connected via a system bus, and the communication interface is also connected to the system bus via the I / O interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides the environment for the operating system and computer programs stored in the non-volatile storage media. The database stores initial text, sample text, and other data. The I / O interfaces are used for exchanging information between the processor and external devices. The communication interface is used for communication with external terminals via a network connection. When executed by the processor, the computer program implements a speech generation model training method.

[0191] Those skilled in the art will understand that Figure 8 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0192] In one exemplary embodiment, a computer device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the above-described method embodiments.

[0193] In one exemplary embodiment, a computer-readable storage medium is provided having a computer program stored thereon that, when executed by a processor, implements the steps in the above-described method embodiments.

[0194] In one exemplary embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps in the above-described method embodiments.

[0195] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments described above. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to these.

[0196] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0197] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. A method for training a speech generation model, characterized in that, The method includes: Obtain the initial podcast text corresponding to the sample podcast audio in a podcast scenario; By fine-tuning the language processing model for the podcast scenario, target semantic words are removed from the initial podcast text to obtain sample podcast text for the podcast speech generation model to be trained; the target semantic words are used to represent words in the initial podcast text that lack actual semantic information. Based on the sample podcast text and the corresponding sample podcast audio, the podcast speech generation model to be trained is iteratively trained to obtain the trained podcast speech generation model.

2. The method according to claim 1, characterized in that, The initial podcast text corresponding to the sample podcast audio in the podcast scenario includes: The sample podcast audio in the podcast scenario is segmented to obtain multiple sub-sample podcast audio; Each subsample podcast audio is input into the speech recognition model to obtain the target recognition text corresponding to each subsample podcast audio. The initial podcast text is obtained based on the target recognition text corresponding to each subsample podcast audio.

3. The method according to claim 2, characterized in that, The step of inputting each sub-sample podcast audio into the speech recognition model to obtain the target recognition text corresponding to each sub-sample podcast audio includes: For any one of the multiple subsample podcast audios, the subsample podcast audio is input into the speech recognition model to obtain the initial recognition text corresponding to the subsample podcast audio; Obtain the text length of the initial recognized text corresponding to the subsample podcast audio, and the audio duration corresponding to the subsample podcast audio; If the ratio between the text length of the initial recognition file corresponding to the subsample podcast audio and the audio duration is greater than or equal to the preset ratio required by the podcast scenario, the initial recognition file corresponding to the subsample podcast audio will be used as the target recognition file corresponding to the subsample podcast audio. If the ratio between the text length of the initial recognition file corresponding to the subsample podcast audio and the audio duration is less than the preset ratio required by the podcast scenario, the process jumps to the step of inputting the subsample podcast audio into the speech recognition model to obtain the initial recognition text corresponding to the subsample podcast audio, until the ratio between the text length of the initial recognition file corresponding to the subsample podcast audio and the audio duration is greater than or equal to the preset ratio required by the podcast scenario.

4. The method according to claim 1, characterized in that, The step of iteratively training the podcast speech generation model to be trained based on the sample podcast text and the corresponding sample podcast audio to obtain the trained podcast speech generation model includes: The text feature vector corresponding to the sample podcast text and the scene feature vector corresponding to the podcast scene are fused to obtain the fused feature vector corresponding to the sample podcast text. The fused feature vector is input into the podcast voice generation model to be trained to obtain the predicted probability of the sample podcast text under each candidate podcast audio. Based on the difference between the candidate podcast audio with the highest predicted probability among all candidate podcast audios and the sample audio, the podcast speech generation model to be trained is iteratively trained to obtain the trained podcast speech generation model.

5. The method according to claim 1, characterized in that, The language processing model, fine-tuned for the podcast scenario, is obtained in the following manner: Obtain the first sample podcast text in the podcast scenario, and the first target podcast text corresponding to the first sample podcast text; The first sample podcast text is input into the language processing model to be adjusted to obtain the first processed podcast text corresponding to the first sample podcast text. Based on the difference between the first processed podcast text and the first target podcast text, the language processing model to be adjusted is fine-tuned to obtain the language processing model fine-tuned for the podcast scenario.

6. The method according to claim 1, characterized in that, Before obtaining the sample podcast text for the podcast speech generation model to be trained by removing target semantic words from the initial podcast text using a language processing model fine-tuned for the podcast scenario, the process also includes: Obtain the second sample podcast text in the podcast scenario, and the second target podcast text corresponding to the second sample podcast text; The second sample podcast text is input into the fine-tuned language processing model to obtain the second processed podcast text corresponding to the second sample podcast text. Based on the second target podcast text and the second processed podcast text, the verification results of the fine-tuned language processing model are obtained; The step involves using a language processing model fine-tuned for the podcast scenario to remove target semantic words from the initial podcast text, resulting in sample podcast text for the podcast speech generation model to be trained. This includes: If the verification result indicates that the fine-tuned language processing model has passed the verification, the target semantic words in the initial podcast text are deleted by using the fine-tuned language processing model for the podcast scenario to obtain the sample podcast text for the podcast speech generation model to be trained.

7. The method according to claim 1, characterized in that, After removing target semantic words from the initial podcast text using a language processing model fine-tuned for the podcast scenario, and obtaining sample podcast text for the podcast speech generation model to be trained, the process further includes: The sample podcast text is validated to obtain the validation result corresponding to the sample podcast text; If the verification result indicates that the sample podcast text fails the verification, the fine-tuned language processing model is fine-tuned again to obtain a finely-tuned language processing model.

8. A speech generation method, characterized in that, The method includes: Obtain the podcast text to be analyzed in a podcast context; The podcast text to be analyzed is input into the trained podcast audio generation model to obtain the target podcast audio corresponding to the podcast text to be analyzed; the trained podcast audio generation model is trained by the method described in any one of claims 1 to 7.

9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 8.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 8.

11. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 8.