Method and apparatus for constructing text-to-speech model, electronic device, readable medium, and program product

By reusing open-source quantizers and autoregressive models, combined with hybrid expert models, the high cost and lack of human-likeness in multi-dialect speech generation were solved, achieving efficient and natural speech generation.

WO2026138018A1PCT designated stage Publication Date: 2026-07-02CHINA TELECOM ARTIFICIAL INTELLIGENCE TECHNOLOGY (BEIJING) CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
CHINA TELECOM ARTIFICIAL INTELLIGENCE TECHNOLOGY (BEIJING) CO LTD
Filing Date
2025-09-19
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Existing technologies require high-quality data annotation for multi-dialect speech generation and necessitate the development of dedicated G2P modules for each dialect, resulting in high development and maintenance costs and insufficient human-likeness in the generated speech.

Method used

We use the FunAudioLLM open-source quantizer to extract the semantic discrete features of the training speech, and combine it with an autoregressive speech model and an optimal transport conditional flow matching model to construct a hybrid expert model. Through a gating mechanism, we route the input of different dialects to the most suitable expert model to generate speech.

Benefits of technology

It reduces development and deployment costs, simplifies system architecture, generates more natural and human-like speech, reduces reliance on linguistic knowledge and pronunciation dictionaries, and improves the naturalness and coherence of speech generation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025122493_02072026_PF_FP_ABST
    Figure CN2025122493_02072026_PF_FP_ABST
Patent Text Reader

Abstract

A method and apparatus for constructing a text-to-speech model, an electronic device, a readable medium, and a program product. The method comprises: inputting preset training speech into a preset vector quantizer, so as to obtain a training semantic dispersion feature of the training speech, the training semantic dispersion feature comprising a language style of the training speech (101); acquiring training text corresponding to the training speech, and using the training text and the training semantic dispersion feature to train a preset autoregressive speech model, so as to obtain a semantic dispersion feature generation model (102); acquiring a training Mel-frequency spectrogram corresponding to the training semantic dispersion feature (103); using the training semantic dispersion feature and the training Mel-frequency spectrogram to train a preset optimal transport conditional flow matching model, so as to obtain a Mel-frequency spectrogram generation model (104); and on the basis of the Mel-frequency spectrogram generation model and the semantic dispersion feature generation model, constructing a text-to-speech model (105).
Need to check novelty before this filing date? Find Prior Art

Description

Methods, devices, electronic equipment, readable media, and program products for constructing speech generation models

[0001] Related applications

[0002] This application claims priority to Chinese patent application filed on December 23, 2024, with application number 2024119109673, entitled "Method, Apparatus, Electronic Device and Readable Medium for Constructing a Speech Generation Model", the entire contents of which are incorporated herein by reference. Technical Field

[0003] This application relates to the field of information technology, and in particular to a method for constructing a speech generation model, an apparatus for constructing a speech generation model, an electronic device, a computer-readable medium, and a computer program product. Background Technology

[0004] Text-to-speech (TTS) technology refers to the technique of converting text into speech. In a TTS scheme, the input text is first preprocessed, including word segmentation, part-of-speech tagging, and prosody prediction. Then, a grapheme-to-phoneme (G2P) module converts the characters or letters (graphes) in the text into corresponding phonemes, which serve as the input to the TTS system. Summary of the Invention

[0005] This application provides a method, apparatus, electronic device, computer-readable storage medium, and computer program product for constructing a speech generation model.

[0006] This application discloses a method for constructing a speech generation model, the method comprising:

[0007] The preset training speech is input into a preset vector quantizer to obtain the training semantic discrete features of the training speech; the training semantic discrete features include the language style of the training speech;

[0008] Obtain the training text corresponding to the training speech, and use the training text and the training semantic discrete features to train a preset autoregressive speech model to obtain a semantic discrete feature generation model.

[0009] Obtain the training Mel spectrum corresponding to the training semantic discrete features;

[0010] The preset optimal transmission condition flow matching model is trained using the trained semantic discrete features and the trained Mel spectrogram to obtain the Mel spectrogram generation model;

[0011] A speech generation model is constructed based on the Mel spectrogram generation model and the semantic discrete feature generation model.

[0012] Optionally, constructing a speech generation model based on the Mel spectrogram generation model and the semantic discrete feature generation model includes:

[0013] The model to be processed is constructed by generating a model using the aforementioned semantic discrete features;

[0014] Obtain the semantic discrete features of the training text to be processed in the target language style and the speech corresponding to the training text to be processed;

[0015] The model to be processed is fine-tuned using the semantic discrete features of the training text to be processed and the speech corresponding to the training text to be processed, so as to obtain a hybrid expert model;

[0016] The speech generation model is constructed based on the hybrid expert model and the Mel spectrogram generation model.

[0017] Optionally, the hybrid expert model includes at least one expert model.

[0018] Optionally, the hybrid expert model further includes a language router; the method includes:

[0019] The preset text is input into the hybrid expert model to obtain the probability distribution of the at least one expert model output by the language router;

[0020] The expert model is used to extract the discrete semantic features of the text;

[0021] Based on the probability distribution, the discrete semantic features of the text are fused to obtain the target discrete semantic features.

[0022] Optionally, the method includes:

[0023] The target semantic discrete features are input into the Mel spectrogram generation model to obtain the Mel frequency map of the text;

[0024] Based on the Mel frequency diagram, the audio waveform of the text is obtained using a preset vocoder;

[0025] Based on the audio waveform and the text, obtain the speech corresponding to the text.

[0026] Optionally, the step of training a preset autoregressive speech model using the training text and the training semantic discrete features to obtain a semantic discrete feature generation model includes:

[0027] During the training process of the autoregressive speech model, the training semantic discrete features are used as the inference output of the autoregressive speech model.

[0028] Optionally, the step of fusing the discrete semantic features of the text based on the probability distribution to obtain the target discrete semantic features includes:

[0029] Based on the probability distribution, the weights of the at least one expert model are determined;

[0030] Based on the weights, the discrete semantic features of the text are fused to obtain the target discrete semantic features.

[0031] Optionally, the training text to be processed in the target language style is a dialect text.

[0032] Optionally, the method further includes: generating speech based on the input text using the constructed speech generation model.

[0033] This application also discloses a speech generation model construction apparatus, the apparatus comprising:

[0034] The training semantic discrete feature acquisition module is used to input a preset training speech into a preset vector quantizer to obtain the training semantic discrete features of the training speech; the training semantic discrete features include the language style of the training speech;

[0035] The autoregressive speech model training module is used to obtain the training text corresponding to the training speech, and use the training text and the training semantic discrete features to train a preset autoregressive speech model to obtain a semantic discrete feature generation model.

[0036] The training Mel spectrogram acquisition module is used to acquire the training Mel spectrogram corresponding to the training semantic discrete features;

[0037] The optimal transmission condition flow matching model training module is used to train a preset optimal transmission condition flow matching model using the training semantic discrete features and the training Mel spectrogram, and obtain a Mel spectrogram generation model.

[0038] The speech generation model construction module is used to construct a speech generation model based on the Mel spectrogram generation model and the semantic discrete feature generation model.

[0039] Optionally, the speech generation model building module includes:

[0040] The model construction submodule is used to generate a model using the semantic discrete features and construct the model to be processed.

[0041] The training text acquisition submodule is used to acquire the training text to be processed in the target language style and the semantic discrete features of the speech corresponding to the training text to be processed.

[0042] The hybrid expert model obtains a sub-module, which is used to fine-tune the model to be processed using the semantic discrete features of the training text to be processed and the speech corresponding to the training text to be processed, thus obtaining the hybrid expert model;

[0043] The speech generation model construction submodule is used to construct the speech generation model based on the hybrid expert model and the Mel spectrogram generation model.

[0044] Optionally, the hybrid expert model includes at least one expert model.

[0045] Optionally, the hybrid expert model further includes a language router; the apparatus includes:

[0046] The probability distribution acquisition module is used to input preset text into the hybrid expert model and obtain the probability distribution of the at least one expert model output by the language router;

[0047] An extraction module is used to extract the semantic discrete features of the text using the expert model;

[0048] The fusion module is used to fuse the discrete semantic features of the text based on the probability distribution to obtain the target discrete semantic features.

[0049] Optionally, the device includes:

[0050] The Mel frequency map acquisition module is used to input the target semantic discrete features into the Mel frequency map generation model to obtain the Mel frequency map of the text;

[0051] An audio waveform acquisition module is used to acquire the audio waveform of the text based on the Mel frequency diagram using a preset vocoder.

[0052] The speech acquisition module is used to acquire the speech corresponding to the text based on the audio waveform and the text.

[0053] Optionally, the autoregressive speech model training module includes:

[0054] The inference output, as a submodule, is used to use the training semantic discrete features as the inference output of the autoregressive speech model during the training process of the autoregressive speech model.

[0055] Optionally, the fusion module includes:

[0056] The weight determination submodule is used to determine the weights of the at least one expert model based on the probability distribution;

[0057] The fusion submodule is used to fuse the semantic discrete features of the text based on the weights to obtain the target semantic discrete features.

[0058] Optionally, the training text to be processed in the target language style is a dialect text.

[0059] Optionally, the device further includes a speech generation module for generating speech based on the input text using the constructed speech generation model.

[0060] This application also discloses an electronic device, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus;

[0061] The memory is used to store computer programs;

[0062] When the processor executes a program stored in the memory, it implements the method described in the embodiments of this application.

[0063] This application also discloses one or more computer-readable media storing instructions that, when executed by one or more processors, cause the processors to perform the methods described in this application.

[0064] This application also discloses a computer program product, including a computer program, wherein when the computer program is executed by a processor, it implements the steps of the speech model construction method described above. Attached Figure Description

[0065] Figure 1 is a flowchart of the steps of a speech generation model construction method provided in an embodiment of this application;

[0066] Figure 2 is a schematic diagram of a shared layer and an expert layer provided in an embodiment of this application;

[0067] Figure 3 is a schematic diagram of the relationship between the gating network and the dialect expert provided in the embodiments of this application;

[0068] Figure 4 is a flowchart of the steps of another speech generation model construction method provided in the embodiments of this application;

[0069] Figure 5 is a structural block diagram of a speech generation model construction device provided in an embodiment of this application;

[0070] Figure 6 is a block diagram of an electronic device provided in an embodiment of this application;

[0071] Figure 7 is a schematic diagram of a computer-readable medium provided in an embodiment of this application. Detailed Implementation

[0072] To make the above-mentioned objectives, features and advantages of this application more apparent and understandable, the application will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0073] In related technologies, multi-dialect speech generation requires sophisticated data annotation for each dialect. Furthermore, the G2P module needs to adapt to the pronunciation rules of different dialects, meaning a dedicated G2P module needs to be developed for each dialect. This typically implies training separate G2P and TTS models for each dialect, resulting in high development and maintenance costs. In addition, speech generated using speech generation technology often lacks a human-like quality.

[0074] This application aims to address the following issues in multi-dialect speech generation: the need for data annotation for multi-dialect speech, which has high requirements; the need to develop a dedicated G2P module for different dialects; and the problem that speech generated using speech generation technology lacks human-like quality.

[0075] Referring to Figure 1, a flowchart of the steps for constructing a speech generation model according to an embodiment of this application is shown, which may specifically include the following steps:

[0076] Step 101: Input the preset training speech into the preset vector quantizer to obtain the training semantic discrete features of the training speech; the training semantic discrete features include the language style of the training speech;

[0077] In this embodiment, a supervised training semantic speech quantizer is used to extract training semantic discrete features from a large-scale set of Mandarin and English speech signals. It should be noted that the training semantic discrete features are a set of discrete symbols or values, which are discrete sequences rich in style and semantics, containing information such as the user's pronunciation habits.

[0078] In this embodiment, the open-source quantizer FunAudioLLM is reused. This quantizer is based on the pre-trained speech recognition model SenseVoice-Large, and a vector quantizer is added after the first six layers of its encoder. FunAudioLLM is an open-source project, and SenseVoice-Large is a speech recognition model.

[0079] A vector quantizer is a "dictionary" containing 4096 channels that maps input speech signals into discrete semantic features. Therefore, in this embodiment, inputting a preset training speech into the vector quantizer yields training semantic features of the training speech. These training semantic features include the language style of the training speech; users exhibit specific pronunciation habits when reading text corresponding to speech with this style. If speech is generated using these pronunciation habits, it possesses a human-like quality.

[0080] Step 102: Obtain the training text corresponding to the training speech, and use the training text and the training semantic discrete features to train a preset autoregressive speech model to obtain a semantic discrete feature generation model.

[0081] In this embodiment of the application, the text corresponding to the large-scale Mandarin and English speech signals can be used as training text, and the training semantic discrete features of the training speech of the large-scale Mandarin and English speech signals are obtained by using a vector quantizer.

[0082] In this embodiment, training text corresponding to the training speech is obtained, and a preset autoregressive speech model is trained using the training text and training semantic discrete features to obtain a semantic discrete feature generation model. It should be noted that the autoregressive speech model is a large model.

[0083] In some embodiments of this application, training a preset autoregressive speech model using the training text and the training semantic discrete features to obtain a semantic discrete feature generation model includes:

[0084] During the training process of the autoregressive speech model, the training semantic discrete features are used as the inference output of the autoregressive speech model.

[0085] In the embodiments of this application, the autoregressive speech model is trained using a teacher-forced paradigm. That is, during the training process of the autoregressive speech model, the training semantic discrete features are used as the inference output of the autoregressive speech model, which reduces the errors accumulated during the training process and accelerates the convergence of the model.

[0086] Specifically, the training semantic discrete features and training text are merged into a unified input sequence, designed as: [START,text_id_seq,semantic_token_seq,END]

[0087] Here, START refers to the start of the sequence, END refers to the end of the sequence, text_id_seq is the training text, and semantic_token_seq is the training semantic discrete features. These four representations must not overlap numerically. The loss function for training the autoregressive speech model is designed as follows:

[0088] Among them, L LM It is the name of the loss function, usually used to represent a specific type of loss. It is a normalization factor that ensures the loss value does not change with the size of the dataset. This summation symbol represents the summation of all terms l from 1 to L+1. log q(μ1) is the logarithmic function, which calculates the logarithmic value of q(μ1).

[0089] During the inference phase, the training text and training semantic discrete features are merged, and the training semantic discrete features are regarded as pre-generated parts. Through this input sequence, the recursive autoregressive speech model gradually predicts the semantic discrete representation corresponding to the text to be synthesized until the "end of sequence" token is encountered.

[0090] Step 103: Obtain the training Mel spectrogram corresponding to the training semantic discrete features;

[0091] In this embodiment of the application, the training Mel spectrum corresponding to the training semantic discrete features can be obtained.

[0092] Step 104: Use the training semantic discrete features and the training Mel spectrogram to train a preset optimal transmission condition flow matching model to obtain a Mel spectrogram generation model;

[0093] In this embodiment, a Mel spectrogram generation model can be obtained by training a preset optimal transport conditional flow matching model (OT-CFM) using trained semantic discrete features and trained Mel spectrograms. It should be noted that audio waveforms can be synthesized from the trained Mel spectrograms using a HiFTNet-based vocoder.

[0094] In this embodiment, a superior transport conditional flow matching model is employed. This model uses a convolutional Transformer U-Net to determine the vector field from the prior distribution to the target distribution. The process from the prior distribution to the target distribution estimates the conditional probability P(S|X,v,S). ref ), where X and v represent the speech token and speaker vector, respectively, and S ref S is the target distribution, and S is the prior distribution. In this embodiment, the semantic-to-speech part reuses the Flow model and HiFt model from the FunAudioLLM open source. This vector field describes how to map from the prior distribution (the known probability distribution of the corresponding training Mel spectrogram obtained using training semantic discrete features) to the target distribution (the target probability distribution of the corresponding training Mel spectrogram obtained using training semantic discrete features).

[0095] Step 105: Construct a speech generation model based on the Mel spectrogram generation model and the semantic discrete feature generation model.

[0096] In the embodiments of this application, a speech generation model can be constructed based on the Mel spectrogram generation model and the semantic discrete feature generation model.

[0097] In some embodiments of this application, constructing a speech generation model based on the Mel spectrogram generation model and the semantic discrete feature generation model includes:

[0098] The model to be processed is constructed by generating a model using the aforementioned semantic discrete features;

[0099] Obtain the semantic discrete features of the training text to be processed in the target language style and the speech corresponding to the training text to be processed;

[0100] The model to be processed is fine-tuned using the semantic discrete features of the training text to be processed and the speech corresponding to the training text to be processed, so as to obtain a hybrid expert model;

[0101] The speech generation model is constructed based on the hybrid expert model and the Mel spectrogram generation model.

[0102] In this embodiment, a model to be processed can be constructed using semantic discrete features to generate a model. Specifically, the base GPT model consists of a series of pre-trained Transformer models, with a total of L layers. The stacked multi-layer Transformer structure in the base GPT model is divided into shared layers and expert layers. The first N layers are shared layers, and the (LN) layers are MDE (Mixture-of-Dialect-Experts) layers. For the MDE layers, the feedforward network layer FFN in the Transformer structure is replaced with the MDE FFN. At this point, the base GPT model is the model to be processed, realizing the construction of the model to be processed using a semantic discrete feature generation model.

[0103] Referring to Figure 2, a schematic diagram of a shared layer and an expert layer provided in an embodiment of this application is shown. The multi-layer Transformer structure stacked in the base GPT model is divided into a shared layer and an expert layer. Among them, the first N layers are shared layers, and the (LN) layers are MDE layers. For the MDE layer, the feedforward network layer FFN in the Transformer structure is replaced with MDE FFN.

[0104] It's important to note that GPT (Generative Pre-trained Transformer) is a deep learning model based on the Transformer architecture, adept at handling natural language generation tasks. Its core idea is to pre-train on large amounts of text data and predict the next word through autoregression. In the field of large-scale speech generation models, the GPT model accepts input text and generates corresponding discrete high-level semantic representations based on its contextual information.

[0105] In this embodiment of the application, the semantic discrete features of the training text to be processed and the speech corresponding to the training text are obtained in the style of the target language.

[0106] Specifically, the training text to be processed in the target language style can be dialect text. The dialect data preparation stage is fundamental to ensuring the model can generate high-quality, natural-sounding dialect speech. Diverse dialect speech data is obtained through a combination of web crawling and manual recording. While retaining dialect labels, corresponding transcriptionists are recruited for each dialect, and detailed transcription rules need to be formulated to ensure the consistency and accuracy of the transcribed text. After transcription, a quality control mechanism needs to be implemented to ensure transcription quality. The dialect speech data is input into a vector quantizer, which is used to obtain the semantic discrete features of the dialect speech, i.e., the discrete speech features of the speech corresponding to the training text to be processed.

[0107] In this embodiment, the model to be processed is fine-tuned using the semantic discrete features of the training text to be processed and the corresponding speech, resulting in a hybrid expert model. Based on the hybrid expert model and the Mel spectrogram generation model, a speech generation model can be constructed. It should be noted that MoE (Mixture of Experts) is a machine learning model architecture that decomposes a complex problem into multiple sub-problems, uses a gating mechanism as a router to partition the space, and guides data to specialized experts. Each sub-problem is processed by a dedicated "expert" sub-network, thereby improving the model's efficiency and performance. Hybrid expert models can be divided into two types: one implicitly partitions the problem space through a gating network optimized by a loss function, and the other explicitly partitions the space, typically using clustering techniques to identify the subspace before training begins.

[0108] In some embodiments of this application, the hybrid expert model includes at least one expert model.

[0109] In this embodiment of the application, the hybrid expert model includes at least one expert model.

[0110] During the training process of the model to be processed, there is a language router before the MDE FFN module of the model to be processed. Through the gating mechanism, based on the input dialect label d, the text to be processed for training, and the semantic discrete feature sequence x of the speech corresponding to the text to be processed for training, a vector with a dimension of n is obtained. The dialect label represents the type of dialect. Each value in the vector represents the probability that the input is routed to the corresponding expert model. The gating mechanism can be specifically expressed as: G(x) = Softmax(KeepTopK(x, d), k)

[0111] According to the probability distribution output by the gating network, the top k most suitable expert models (k << n, where n is the total number of expert models) are selected, and the unselected expert models will not be activated. Each activated expert model has its own independent parameters, and finally, the output results of the selected expert models are weighted and fused according to the probabilities generated by the gating network.

[0112] For example: If k expert models are activated, their outputs are combined according to their respective weights to generate the final dialect-specific inference semantic discrete representation:

[0113] where, G i (x, d) represents the weight of the i-th expert model, and E i (x) represents the processing result of the i-th expert model on the input.

[0114] The expert model is also called a dialect expert. Referring to Figure 3, a schematic diagram showing the relationship between the gating network and the dialect expert provided in the embodiment of the present application is shown. The preset text is input into the mixture of experts model, and the probability distribution of at least one expert model output by the gating network is obtained. There are activated dialect experts and unactivated dialect experts in the mixture of experts model.

[0115] It should be noted that the gating mechanism: is a technique especially used to control information flow in a neural network. The gating mechanism determines which input signals should be passed to the subsequent layer or output by learning one or more gating functions. The gating network is often used to process complex tasks, such as multi-modal data fusion, feature selection, and model sparsity, etc.

[0116] In some embodiments of the present application, the mixture of experts model further includes a language router; the method includes:

[0117] Input the preset text into the mixture of experts model to obtain the probability distribution of the at least one expert model output by the language router;

[0118] Use the expert model to extract the semantic discrete features of the text;

[0119] Based on the probability distribution, the discrete semantic features of the text are fused to obtain the target discrete semantic features.

[0120] In this embodiment, the hybrid expert model further includes a language router. Preset text is input into the hybrid expert model to obtain the probability distribution of at least one expert model output by the language router. Semantic discrete features of the text are extracted using the expert models. Based on the probability distribution, the semantic discrete features extracted by multiple expert models are fused to obtain the target semantic discrete features. It should be noted that the relationship between the language router and the at least one expert model is the same during model training to obtain the hybrid expert model and during its application. The language router is used to determine the probability distribution of at least one expert model through a gating mechanism, which will not be elaborated further in this application.

[0121] In some embodiments of this application, fusing the semantic discrete features of the text based on the probability distribution to obtain the target semantic discrete features includes:

[0122] Based on the probability distribution, the weights of the at least one expert model are determined;

[0123] Based on the weights, the discrete semantic features of the text are fused to obtain the target discrete semantic features.

[0124] In this embodiment, the weights of at least one expert model are determined based on a probability distribution. The semantic discrete features of the text are then fused according to these weights to obtain the target semantic discrete features.

[0125] In some embodiments of this application, the method includes:

[0126] The target semantic discrete features are input into the Mel spectrogram generation model to obtain the Mel frequency map of the text;

[0127] Based on the Mel frequency diagram, the audio waveform of the text is obtained using a preset vocoder;

[0128] Based on the audio waveform and the text, obtain the speech corresponding to the text.

[0129] In this embodiment, the discrete semantic features extracted by the hybrid expert model are input into the Mel frequency map generation model to obtain the Mel frequency map of the text. Based on the Mel frequency map, the audio waveform of the text can be obtained using a preset vocoder. By processing the text according to the audio waveform, the corresponding speech can be obtained.

[0130] In this embodiment, a preset training speech is input into a preset vector quantizer to obtain training semantic discrete features of the training speech, which include the language style of the training speech. Training text corresponding to the training speech is obtained, and a preset autoregressive speech model is trained using the training text and training semantic discrete features to obtain a semantic discrete feature generation model. Training Mel spectrograms corresponding to the training semantic discrete features are obtained, and a preset optimal transport conditional stream matching model is trained using the training semantic discrete features and training Mel spectrograms to obtain a Mel spectrogram generation model. Based on the Mel spectrogram generation model and the semantic discrete feature generation model, a speech generation model is constructed. This achieves the generation of speech or dialects with multiple language styles through a single speech generation model, thereby reducing the cost and complexity of development and deployment. It also possesses good scalability, significantly reducing reliance on dialect linguistic knowledge and pronunciation dictionaries, lowering annotation costs. Furthermore, the speech generation model generates speech based on semantic discrete features corresponding to language styles, effectively alleviating the mechanical feel of traditional speech generation methods in audio generation.

[0131] In this application embodiment, a hybrid expert model is obtained by fine-tuning a pre-trained semantic discrete feature generation model, effectively reducing the mechanical feel of traditional speech generation methods in audio generation. This application simplifies the system architecture for dialect speech generation, achieving efficient and scalable multi-dialect speech generation. By introducing the hybrid expert model, speech generation of multiple dialects can be achieved in a single model. The pre-trained semantic discrete feature generation model is introduced into dialect speech generation, and a Mel spectrogram generation model is used to obtain Mel spectrograms and audio waveforms, generating more natural and human-like dialect speech.

[0132] This application employs a hybrid expert model, which divides the speech generation of multiple dialects into different subspaces. A gating mechanism routes the inputs of different dialects to the most suitable expert model, achieving a "divide and conquer" strategy. This approach enables the generation of multiple dialects on a single model, significantly reducing development and deployment costs, simplifying the system architecture for dialect generation applications, and eliminating the need to deploy a separate speech generation system for each dialect.

[0133] This application employs a character-level modeling approach, directly utilizing the GPT model. Based on the contextual information of the character representation, it leverages the powerful representational capabilities of the model to automatically learn the mapping relationship between characters and high-dimensional semantic discrete representations, predicting discrete speech tag sequences and achieving speech generation. This eliminates the need to maintain a traditional G2P (character-to-phoneme) model for each dialect, and removes the reliance on extensive linguistic knowledge, rules, pronunciation dictionaries, and large-scale, precise phoneme annotation data. Only speech and corresponding text transcription annotations are required as input, significantly reducing data annotation and model development costs. Furthermore, by directly modeling at the character level, the model can better capture global contextual information, rather than being limited to local phoneme representations, resulting in more natural, coherent, and human-like speech generation.

[0134] This application employs a semantic discrete feature generation model pre-trained on over 100,000 hours of Mandarin and English speech data as a foundation, fine-tuning the model to be processed to obtain a hybrid expert model. The large-scale training data covers a richer and broader range of speech features, including diverse intonation, speech rate, prosody, and pronunciation details, and has learned powerful contextual understanding capabilities from massive amounts of data, while also possessing strong domain knowledge transfer and generalization abilities. This approach reduces the mechanical feel of traditional speech generation methods and improves the naturalness, coherence, and fluency of the generated speech.

[0135] Referring to Figure 4, a flowchart of the steps for constructing another speech generation model provided in this embodiment is shown, which may specifically include the following steps:

[0136] Step 401: Extract discrete speech quantization representations from large-scale speech data.

[0137] Step 402: Text-to-semantics and semantic-to-speech stages of the pre-trained base speech generation model.

[0138] Step 403: Dialect data collection, manual transcription, and quality inspection.

[0139] Step 404: Construct a dialect-mixed expert model to fine-tune the text to the semantic stage.

[0140] For example, the method for constructing a speech generation model further includes using the constructed speech generation model to generate speech based on the input text. It is readily understood that the text to be converted can be used as input to the constructed speech generation model, which processes the input text and generates speech as output, thereby converting the text to be converted into speech. Furthermore, the generated speech can be from various dialects.

[0141] It should be noted that, for the sake of simplicity, the method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments of this application are not limited to the described order of actions, because according to the embodiments of this application, some steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of this application.

[0142] Referring to Figure 5, a structural block diagram of a speech generation model construction device provided in an embodiment of this application is shown, which may specifically include the following modules:

[0143] The training semantic discrete feature acquisition module 501 is used to input a preset training speech into a preset vector quantizer to obtain the training semantic discrete features of the training speech; the training semantic discrete features include the language style of the training speech.

[0144] The autoregressive speech model training module 502 is used to obtain the training text corresponding to the training speech, and use the training text and the training semantic discrete features to train a preset autoregressive speech model to obtain a semantic discrete feature generation model.

[0145] The training Mel spectrogram acquisition module 503 is used to acquire the training Mel spectrogram corresponding to the training semantic discrete features;

[0146] The optimal transmission condition flow matching model training module 504 is used to train a preset optimal transmission condition flow matching model using the training semantic discrete features and the training Mel spectrogram to obtain a Mel spectrogram generation model.

[0147] The speech generation model construction module 505 is used to construct a speech generation model based on the Mel spectrogram generation model and the semantic discrete feature generation model.

[0148] In one optional embodiment of this application, the speech generation model construction module includes:

[0149] The model construction submodule is used to generate a model using the semantic discrete features and construct the model to be processed.

[0150] The training text acquisition submodule is used to acquire the training text to be processed in the target language style and the semantic discrete features of the speech corresponding to the training text to be processed.

[0151] The hybrid expert model obtains a sub-module, which is used to fine-tune the model to be processed using the semantic discrete features of the training text to be processed and the speech corresponding to the training text to be processed, thus obtaining the hybrid expert model;

[0152] The speech generation model construction submodule is used to construct the speech generation model based on the hybrid expert model and the Mel spectrogram generation model.

[0153] In one alternative embodiment of this application, the hybrid expert model includes at least one expert model.

[0154] In one optional embodiment of this application, the hybrid expert model further includes a language router; the apparatus includes:

[0155] The probability distribution acquisition module is used to input preset text into the hybrid expert model and obtain the probability distribution of the at least one expert model output by the language router;

[0156] An extraction module is used to extract the semantic discrete features of the text using the expert model;

[0157] The fusion module is used to fuse the discrete semantic features of the text based on the probability distribution to obtain the target discrete semantic features.

[0158] In one optional embodiment of this application, the apparatus includes:

[0159] The Mel frequency map acquisition module is used to input the target semantic discrete features into the Mel frequency map generation model to obtain the Mel frequency map of the text;

[0160] An audio waveform acquisition module is used to acquire the audio waveform of the text based on the Mel frequency diagram using a preset vocoder.

[0161] The speech acquisition module is used to acquire the speech corresponding to the text based on the audio waveform and the text.

[0162] In one optional embodiment of this application, the autoregressive speech model training module includes:

[0163] The inference output, as a submodule, is used to use the training semantic discrete features as the inference output of the autoregressive speech model during the training process of the autoregressive speech model.

[0164] In one optional embodiment of this application, the fusion module includes:

[0165] The weight determination submodule is used to determine the weights of the at least one expert model based on the probability distribution;

[0166] The fusion submodule is used to fuse the semantic discrete features of the text based on the weights to obtain the target semantic discrete features.

[0167] In one optional embodiment of this application, the training text to be processed in the target language style is a dialect text.

[0168] In one optional embodiment of this application, the apparatus further includes a speech generation module for generating speech based on the input text using the constructed speech generation model.

[0169] As the device embodiment is basically similar to the method embodiment, the description is relatively simple, and relevant parts can be found in the description of the method embodiment.

[0170] In addition, this application embodiment also provides an electronic device, as shown in FIG6, including a processor 601, a communication interface 602, a memory 603, and a communication bus 604, wherein the processor 601, the communication interface 602, and the memory 603 communicate with each other through the communication bus 604.

[0171] Memory 603 is used to store computer programs;

[0172] When processor 601 executes a program stored in memory 603, it performs the following steps:

[0173] The preset training speech is input into a preset vector quantizer to obtain the training semantic discrete features of the training speech; the training semantic discrete features include the language style of the training speech;

[0174] Obtain the training text corresponding to the training speech, and use the training text and the training semantic discrete features to train a preset autoregressive speech model to obtain a semantic discrete feature generation model.

[0175] Obtain the training Mel spectrum corresponding to the training semantic discrete features;

[0176] The preset optimal transmission condition flow matching model is trained using the trained semantic discrete features and the trained Mel spectrogram to obtain the Mel spectrogram generation model;

[0177] A speech generation model is constructed based on the Mel spectrogram generation model and the semantic discrete feature generation model.

[0178] In one optional embodiment of this application, constructing a speech generation model based on the Mel spectrogram generation model and the semantic discrete feature generation model includes:

[0179] The model to be processed is constructed by generating a model using the aforementioned semantic discrete features;

[0180] Obtain the semantic discrete features of the training text to be processed in the target language style and the speech corresponding to the training text to be processed;

[0181] The model to be processed is fine-tuned using the semantic discrete features of the training text to be processed and the speech corresponding to the training text to be processed, so as to obtain a hybrid expert model;

[0182] The speech generation model is constructed based on the hybrid expert model and the Mel spectrogram generation model.

[0183] In one alternative embodiment of this application, the hybrid expert model includes at least one expert model.

[0184] In one optional embodiment of this application, the hybrid expert model further includes a language router; the method includes:

[0185] The preset text is input into the hybrid expert model to obtain the probability distribution of the at least one expert model output by the language router;

[0186] The expert model is used to extract the discrete semantic features of the text;

[0187] Based on the probability distribution, the discrete semantic features of the text are fused to obtain the target discrete semantic features.

[0188] In one optional embodiment of this application, the method includes:

[0189] The target semantic discrete features are input into the Mel spectrogram generation model to obtain the Mel frequency map of the text;

[0190] Based on the Mel frequency diagram, the audio waveform of the text is obtained using a preset vocoder;

[0191] Based on the audio waveform and the text, obtain the speech corresponding to the text.

[0192] In one optional embodiment of this application, training a preset autoregressive speech model using the training text and the training semantic discrete features to obtain a semantic discrete feature generation model includes:

[0193] During the training process of the autoregressive speech model, the training semantic discrete features are used as the inference output of the autoregressive speech model.

[0194] In one optional embodiment of this application, fusing the semantic discrete features of the text based on the probability distribution to obtain the target semantic discrete features includes:

[0195] Based on the probability distribution, the weights of the at least one expert model are determined;

[0196] Based on the weights, the discrete semantic features of the text are fused to obtain the target discrete semantic features.

[0197] The communication bus mentioned above can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. This communication bus can be divided into address bus, data bus, control bus, etc. For ease of illustration, only one thick line is used to represent it in the diagram, but this does not mean that there is only one bus or one type of bus.

[0198] The communication interface is used for communication between the aforementioned terminal and other devices.

[0199] The memory may include random access memory (RAM) or non-volatile memory, such as at least one disk storage device. Optionally, the memory may also be at least one storage device located remotely from the aforementioned processor.

[0200] The processors mentioned above can be general-purpose processors, including central processing units (CPUs), network processors (NPs), etc.; they can also be digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

[0201] The features described in the embodiments of the above-mentioned speech generation model construction method are applicable to embodiments of electronic devices. Various implementation methods and beneficial effects of embodiments of electronic devices can be found in the relevant descriptions in the embodiments of the aforementioned speech generation model construction method, and will not be repeated here.

[0202] As shown in Figure 7, in another embodiment provided in this application, a computer-readable storage medium 701 is also provided, which stores instructions that, when executed on a computer, cause the computer to execute a method for constructing a speech generation model as described in the above embodiments.

[0203] The features described in the embodiments of the above-described speech generation model construction method are applicable to embodiments of computer-readable storage media. Various implementation methods and beneficial effects of the computer-readable storage media embodiments can be found in the relevant descriptions in the embodiments of the aforementioned speech generation model construction method, and will not be repeated here.

[0204] In another embodiment provided in this application, a computer program product containing instructions is also provided, which, when run on a computer, causes the computer to execute a method for constructing a speech generation model as described in the above embodiments.

[0205] The features described in the embodiments of the above-mentioned speech generation model construction method are applicable to the embodiments of the computer program product. Various implementation methods and beneficial effects of the embodiments of the computer program product can be found in the relevant descriptions in the embodiments of the aforementioned speech generation model construction method, and will not be repeated here.

[0206] In the above embodiments, implementation can be achieved entirely or partially through software, hardware, firmware, or any combination thereof. When implemented using software, it can be implemented entirely or partially in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (SSD)).

[0207] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0208] The various embodiments in this specification are described in a related manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the system embodiments are basically similar to the method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions of the method embodiments.

[0209] The above description is merely a preferred embodiment of this application and is not intended to limit the scope of protection of this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application are included within the scope of protection of this application.

Claims

1. A method for constructing a speech generation model, comprising: The preset training speech is input into a preset vector quantizer to obtain the training semantic discrete features of the training speech; The training semantic discrete features include the language style of the training speech; Obtain the training text corresponding to the training speech, and use the training text and the training semantic discrete features to train a preset autoregressive speech model to obtain a semantic discrete feature generation model. Obtain the training Mel spectrum corresponding to the training semantic discrete features; The preset optimal transmission condition flow matching model is trained using the trained semantic discrete features and the trained Mel spectrogram to obtain the Mel spectrogram generation model; A speech generation model is constructed based on the Mel spectrogram generation model and the semantic discrete feature generation model.

2. The method according to claim 1, wherein, The construction of a speech generation model based on the Mel spectrogram generation model and the semantic discrete feature generation model includes: The model to be processed is constructed by generating a model using the aforementioned semantic discrete features; Obtain the semantic discrete features of the training text to be processed in the target language style and the speech corresponding to the training text to be processed; The model to be processed is fine-tuned using the semantic discrete features of the training text to be processed and the speech corresponding to the training text to be processed, so as to obtain a hybrid expert model; The speech generation model is constructed based on the hybrid expert model and the Mel spectrogram generation model.

3. The method according to claim 2, wherein, The hybrid expert model includes at least one expert model.

4. The method according to claim 3, wherein, The hybrid expert model further includes a language router; the method includes: The preset text is input into the hybrid expert model to obtain the probability distribution of the at least one expert model output by the language router; The expert model is used to extract the discrete semantic features of the text; Based on the probability distribution, the discrete semantic features of the text are fused to obtain the target discrete semantic features.

5. The method according to claim 4, comprising: The target semantic discrete features are input into the Mel spectrogram generation model to obtain the Mel frequency map of the text; Based on the Mel frequency diagram, the audio waveform of the text is obtained using a preset vocoder; Based on the audio waveform and the text, obtain the speech corresponding to the text.

6. The method according to claim 1, wherein, The step of training a pre-defined autoregressive speech model using the training text and the training semantic discrete features to obtain a semantic discrete feature generation model includes: During the training process of the autoregressive speech model, the training semantic discrete features are used as the inference output of the autoregressive speech model.

7. The method according to claim 4, wherein, The process of fusing the discrete semantic features of the text based on the probability distribution to obtain the target discrete semantic features includes: Based on the probability distribution, the weights of the at least one expert model are determined; Based on the weights, the discrete semantic features of the text are fused to obtain the target discrete semantic features.

8. The method according to claim 2, wherein, The training text to be processed in the target language style is a dialect text.

9. The method according to any one of claims 1-8, further comprising: Using the constructed speech generation model, speech is generated based on the input text.

10. An apparatus for constructing a speech generation model, comprising: The training semantic discrete feature acquisition module is used to input the preset training speech into the preset vector quantizer to obtain the training semantic discrete features of the training speech; The training semantic discrete features include the language style of the training speech; The autoregressive speech model training module is used to obtain the training text corresponding to the training speech, and use the training text and the training semantic discrete features to train a preset autoregressive speech model to obtain a semantic discrete feature generation model. The training Mel spectrogram acquisition module is used to acquire the training Mel spectrogram corresponding to the training semantic discrete features; The optimal transmission condition flow matching model training module is used to train a preset optimal transmission condition flow matching model using the training semantic discrete features and the training Mel spectrogram, and obtain a Mel spectrogram generation model. The speech generation model construction module is used to construct a speech generation model based on the Mel spectrogram generation model and the semantic discrete feature generation model.

11. An electronic device, wherein, It includes a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus; The memory is used to store computer programs; When the processor executes a program stored in the memory, it implements the method as described in any one of claims 1-9.

12. One or more computer-readable media having instructions stored thereon that, when executed by one or more processors, cause the processors to perform the method as described in any one of claims 1-9.

13. A computer program product comprising computer instructions, wherein, When the computer instructions are executed by the processor, they implement the method described in any one of claims 1-9.