A multilingual speech synthesis method and related apparatus
By introducing a dual-gated network architecture into the multilingual speech synthesis model and combining text features and style features to generate expert routing results, the problem of unbalanced load in the early stage of model training is solved, and the accuracy and stability of speech synthesis are improved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ANHUI IFLYTEK UNIVERSAL LANGUAGE TECH CO LTD
- Filing Date
- 2025-12-24
- Publication Date
- 2026-06-30
AI Technical Summary
In the early stages of training, existing multilingual speech synthesis models suffer from unbalanced load due to insufficient specialization of expert networks, leading to a decrease in the accuracy of synthesized speech. The existing algorithm-driven load balancing methods impair the model's representational ability and the degree of expert specialization.
A dual-gated network architecture is adopted. The first branch network generates expert routing results for the text feature dimension, the second branch network generates expert routing results for the style feature dimension, and the third expert routing result is generated by combining the language identifier. Together, they determine the target expert network to be activated, forming a natural balance that matches the language characteristics.
By leveraging the differences in text features and stylistic expressions between languages, the accuracy of expert routing and speech synthesis was improved, while avoiding damage to the model's representational ability and the level of expert professionalism.
Smart Images

Figure CN121708899B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of speech synthesis technology, and in particular to a multilingual speech synthesis method and related apparatus. Background Technology
[0002] Currently, when speech synthesis models face complex tasks involving multilingual speech synthesis, they often integrate Hybrid Expert Networks (MoEs). For example, speech synthesis models based on large-scale language models integrate MoEs in all transformer layers. The gating network in the MoE determines the target expert network to be activated for the current task based on the hidden state of the input, and then feeds the hidden state into the target expert network for further processing. Because the gating network can distinguish the language of the current task based on the hidden state, it can achieve the goal of routing data from different languages to different expert networks.
[0003] However, in the early stages of training, speech synthesis models integrating MoE suffer from significant randomness in routing decisions due to the lack of expert specialization. This can lead to some experts being accessed excessively frequently, while others remain under-learned, resulting in an expert load imbalance problem. Current approaches typically address load balancing from a purely algorithmic and data-driven perspective. However, this purely algorithmic and data-driven approach is a forced balancing method, which can impair the model's representational capabilities and expert specialization, ultimately reducing the accuracy of synthesized speech.
[0004] Even after training, the input hidden state contains more information in the text feature dimension. Since some languages are very similar in text features, it may be difficult to accurately distinguish these languages by relying solely on text features. This exacerbates the degree of expert error routing and further reduces the accuracy of synthesized speech. Summary of the Invention
[0005] In view of the above problems, this application provides a multilingual speech synthesis method and related apparatus to achieve accurate expert routing by utilizing the dual differences in text features and stylistic expression between languages, thereby improving the accuracy of speech synthesis. The specific solution is as follows:
[0006] The first aspect of this application provides a multilingual speech synthesis method, including:
[0007] Obtain the target data for speech synthesis;
[0008] A speech synthesis model is invoked to generate a discrete speech marker sequence based on the target data. The speech synthesis model integrates a hybrid expert network (MoE). The gated network of the MoE includes a first branch network and a second branch network. The first branch network is used to generate a first expert routing result in the text feature dimension for the hidden state of the input. The hidden state is the feature extracted from the target data by the network layers before the gated network. The second branch network is used to generate a second expert routing result in the style feature dimension for the target data. The first expert routing result and the second expert routing result together determine the target expert network to be activated.
[0009] Speech synthesis is performed based on the discrete speech marker sequence to obtain synthesized speech.
[0010] In one possible implementation, the target data includes target speech, and the second branch network generates a second expert routing result for the target speech in the style feature dimension, including:
[0011] The target speech is preprocessed by the input layer to obtain the Mel spectrogram of the target speech;
[0012] The target acoustic features are obtained by encoding the Mel spectrogram using a reference encoder.
[0013] The target acoustic features are resampled using a perceptual resampler to obtain N potential style features, where N is a positive integer.
[0014] The output layer generates the second expert routing result based on the N latent style features, wherein the second expert routing result includes the routing results of each expert network corresponding to the N latent style features.
[0015] In one possible implementation, the step of encoding the target acoustic features by referencing the Mel spectrogram using an acoustic feature encoder includes:
[0016] The input acoustic features are downsampled by each of the convolutional neural networks in one or more sequentially connected convolutional neural networks to obtain the output acoustic features. The input acoustic features of any convolutional neural network are the output acoustic features of the preceding convolutional neural network, and the input acoustic features of the first convolutional neural network are the Mel spectrogram.
[0017] The target acoustic features are obtained by performing global average pooling on the output acoustic features of the last convolutional neural network through a pooling layer.
[0018] In one possible implementation, the style resampling via a perceptual resampler based on the target acoustic features yields N potential style features, including:
[0019] The target acoustic features are cross-attentionally calculated with N learnable query vectors through an attention layer to obtain the query attention representation output by the attention layer.
[0020] The residual fusion representation is obtained by performing residual calculation on the target acoustic features and the query attention representation through the first residual layer;
[0021] The residual fusion representation is fed forward through a feedforward network layer to obtain a feedforward enhanced representation;
[0022] The N latent style features are obtained by performing residual calculations on the residual fusion representation and the feedforward enhancement representation through the second residual layer.
[0023] In one possible implementation, the gated network of the MoE further includes a third branch network, and the target data further includes a language identifier of the language to be synthesized. The third branch network is used to generate a third expert routing result in the language feature dimension based on the language identifier.
[0024] The first expert routing result and the second expert routing result jointly determine the target expert network to be activated, including:
[0025] The first expert routing result, the second expert routing result, and the third expert routing result together determine the target expert network.
[0026] In one possible implementation, the third branch network generates a third expert routing result in the language feature dimension based on the language identifier, including:
[0027] The language identifier is converted into a language vector representation through an embedding layer;
[0028] The language vector representation is subjected to a second linear transformation through a second linear transformation layer to obtain the third expert routing result.
[0029] In one possible implementation, where the gated network of the MoE includes the third branch network, N is greater than 1;
[0030] The first expert routing result, the second expert routing result, and the third expert routing result jointly determine the target expert network, including:
[0031] The third expert routing result and the second expert routing result are concatenated along the sequence length dimension where N is located to obtain the concatenated routing result.
[0032] The concatenated routing result is transformed by mapping the sequence length dimension to align with the first expert routing result in the sequence length dimension, thus obtaining the transformed routing result;
[0033] The transformed routing result and the first expert routing result are weighted and fused to obtain the fused routing result;
[0034] The target expert network is determined based on the fused routing results.
[0035] In one possible implementation, the language identifier includes a dialect identifier and / or a language identifier.
[0036] In one possible implementation, the first branch network generates a first expert routing result in the text feature dimension for the input hidden state, including:
[0037] The first expert routing result is obtained by performing a first linear transformation on the hidden state through a first linear transformation layer.
[0038] A second aspect of this application provides a computer program product including computer-readable instructions that, when executed on an electronic device, cause the electronic device to implement the multilingual speech synthesis method described in the first aspect or any implementation thereof.
[0039] A third aspect of this application provides an electronic device, comprising at least one processor and a memory connected to the processor, wherein:
[0040] The memory is used to store computer programs;
[0041] The processor is used to execute the computer program so that the electronic device can implement the multilingual speech synthesis method of the first aspect or any implementation thereof.
[0042] A fourth aspect of this application provides a computer storage medium carrying one or more computer programs, which, when executed by an electronic device, enable the electronic device to implement the multilingual speech synthesis method described in the first aspect or any implementation thereof.
[0043] Using the above technical solution, the multilingual speech synthesis method provided in this application obtains the target data to be synthesized and calls the speech synthesis model to generate a discrete speech marker sequence based on the target data. Considering that languages with similar text features usually have significant differences in style expression, in order to fully utilize the differences in style expression between different languages, this application integrates a hybrid expert network (MoE) into the speech synthesis model and adds another language style-related gating network to the original gating network of MoE, forming a dual-gating network, namely a first branch network and a second branch network. The first branch network retains its original function, that is, generating a first expert routing result in the text feature dimension for the hidden state of the input. The second branch network is used to generate a second expert routing result in the style feature dimension for the target data. The first expert routing result and the second expert routing result jointly determine the target expert network to be activated. Finally, speech synthesis is performed based on the discrete speech marker sequence to obtain synthesized speech. Therefore, this application not only utilizes the differences in text features between languages, but also the differences in style expression between languages. Since these two differences are inherent dimensions of difference between languages, expert routing is performed accordingly, achieving a natural balance of experts that matches the characteristics of the language. This avoids damaging the model's representational ability and the level of expert specialization, and also improves the accuracy of expert routing, thereby improving the accuracy of speech synthesis. Attached Figure Description
[0044] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. Throughout the drawings, the same or similar reference numerals denote the same or similar elements. It should be understood that the drawings are schematic, and the originals and elements are not necessarily drawn to scale.
[0045] Figure 1 A schematic diagram of a system architecture provided for this application;
[0046] Figure 2 A flowchart illustrating a multilingual speech synthesis method provided in this application;
[0047] Figure 3 A schematic diagram of the structure of a second branch network provided in this application;
[0048] Figure 4 A schematic diagram of the structure of a multilingual speech synthesis device provided in this application;
[0049] Figure 5 This is a schematic diagram of the structure of an electronic device provided in this application. Detailed Implementation
[0050] The embodiments of this application are described below with reference to the accompanying drawings. The terminology used in the implementation section of this application is for explaining specific embodiments only and is not intended to limit the scope of this application.
[0051] The embodiments of this application will now be described with reference to the accompanying drawings. Those skilled in the art will recognize that, with technological advancements and the emergence of new scenarios, the technical solutions provided in the embodiments of this application are equally applicable to similar technical problems.
[0052] The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such terms are interchangeable where appropriate; this is merely a way of distinguishing objects with the same attributes in the embodiments of this application. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion, so that a process, method, system, product, or apparatus that comprises a series of elements is not necessarily limited to those elements, but may include other elements not explicitly listed or inherent to those processes, methods, products, or apparatuses.
[0053] As mentioned in the background, integrating MoE may lead to load imbalance. To solve this problem, current solutions are mainly provided from the perspective of pure algorithms and data-driven approaches.
[0054] For example, load balancing schemes based on auxiliary loss functions are the most direct and common methods for handling MoE load imbalance. The core of this approach is to introduce an auxiliary loss term, such as importance loss or load loss, specifically for balancing expert load, in addition to the model's standard task loss function. Importance loss aims to balance the sum of the "importance" of the samples processed by each expert in a batch of data as much as possible; load loss, on the other hand, more directly pursues a balanced distribution of the number of samples assigned to each expert.
[0055] While load balancing schemes based on auxiliary loss functions can theoretically force a balanced expert load, their side effects are particularly significant in speech synthesis tasks. Forced balancing may contradict the original intent of the MoE design: in multilingual speech synthesis, the desired outcome is not absolute mathematical balance, but rather a natural balance that matches the characteristics of the language. For example, a dialect with higher data complexity or larger data volume should be allocated more computational resources (i.e., more frequent use of certain experts). The auxiliary loss function, in pursuit of quantitative balance, may incorrectly route samples that should be processed by the target expert to other experts, interfering with the expert specialization process, leading to a decrease in model representation ability and a loss of naturalness in the synthesized speech.
[0056] For example, a scheme based on token discarding and load limiting limits the processing capacity of each expert by setting a hard threshold to prevent individual experts from becoming overloaded. However, with a hard threshold, when the routing weight calculated by the gating network for a certain expert exceeds its preset capacity, the system will forcibly discard some input tokens with lower weights or reroute them to other less overloaded experts. This method is also typically combined with the aforementioned auxiliary loss function for collaborative operation.
[0057] However, speech signals have extremely strong temporal coherence and information density. The loss of any token can lead to perceptible plosives, discontinuities, or semantic distortions in the synthesized speech, reducing the accuracy of speech synthesis. Rerouting may also result in the loss of tokens due to the inability to find suitable experts, and it introduces routing noise, disrupting the coherence and stability of speech generation. For TTS systems that require high fidelity and naturalness, this approach has very low applicability.
[0058] In summary, the solutions mentioned above are all forced equalization methods, which will impair the model's representational ability and the level of expert specialization, leading to a decrease in the accuracy of synthesized speech.
[0059] Of course, there are other solutions for load balancing, such as soft optimization schemes based on modifications to the routing mechanism, which induce a more balanced load distribution by improving the routing algorithm itself. For example, Gaussian noise can be injected into the calculated Logits value before the gated network outputs the routing weights, or the hard Top-K routes can be abandoned, and tokens can be distributed to all experts with fractional weights.
[0060] However, adding noise may interfere with normal route learning due to excessive noise, causing the model to learn incorrect route representations. Furthermore, assigning tokens to all experts with score weights completely violates the sparsity design principle of MoE, requiring each token to be calculated by all experts, which drastically increases the computational cost of the model. This contradicts the original intention of using MoE to improve efficiency and is not practical in computationally resource-sensitive inference scenarios.
[0061] To address the aforementioned issues, this application provides a multilingual speech synthesis method and related apparatus.
[0062] Optionally, the multilingual speech synthesis method provided in this application can be applied to scenarios that require speech synthesis, such as novel reading scenarios and emotional companionship voice dialogue scenarios of virtual digital humans.
[0063] It should be noted that the above scenarios are merely examples and are not intended to limit this application.
[0064] Optionally, the multilingual speech synthesis method provided in this application can be applied to, for example... Figure 1 The system architecture shown includes a terminal 100 and a server 200. The server 200 may include one or more servers (…). Figure 1 (This example uses a server as an illustration).
[0065] Either terminal 100 or server 200 can be used independently to execute the multilingual speech synthesis method provided in the embodiments of this application. Alternatively, terminal 100 and server 200 can also be used collaboratively to execute the multilingual speech synthesis method provided in the embodiments of this application.
[0066] The following description Figure 1 The product form of the mid-terminal 100;
[0067] The terminal 100 in this application embodiment can be a mobile phone, tablet computer, wearable device, vehicle device, augmented reality (AR) / virtual reality (VR) device, laptop computer, ultra-mobile personal computer (UMPC), netbook, personal digital assistant (PDA), etc., and this application embodiment does not impose any restrictions on it.
[0068] To enable those skilled in the art to better understand this application, the multilingual speech synthesis method of the embodiments of this application will be described in detail below with reference to the accompanying drawings.
[0069] Reference Figure 2 , Figure 2 This application provides a flowchart illustrating a multilingual speech synthesis method as shown in the embodiments. Figure 2 As shown, this multilingual speech synthesis method may include:
[0070] Step S101: Obtain the target data for speech synthesis.
[0071] The target data may include at least one of the target speech and the target text.
[0072] The target speech is mainly used to provide a timbre reference for synthesized speech. Of course, the target speech can also provide other reference information, such as style reference.
[0073] The target text is used to provide the semantic content of the synthesized speech. Optionally, if the target text contains language-specific content, it can also be used to assist in identifying the language to be synthesized.
[0074] Optionally, in order to more accurately identify the target language to be synthesized, the target data may further include a language identifier of the target language to be synthesized.
[0075] Optionally, the language identifier may be the name of the target language to be synthesized, an identity identifier, etc. For example, the name of the Cantonese dialect is "Cantonese" and the identity identifier is "0".
[0076] Step S102, call a speech synthesis model to generate a speech discrete token sequence according to the target data. The speech synthesis model integrates a Mixture of Experts (MoE) network. The gating network of the MoE includes a first branch network and a second branch network. The first branch network is used to generate a first expert routing result in the text feature dimension for the input hidden state, where the hidden state is the feature extracted from the target data by the network layer before passing through the gating network. The second branch network is used to generate a second expert routing result in the style feature dimension for the target data. The first expert routing result and the second expert routing result jointly determine the target expert network to be activated.
[0077] Optionally, the speech synthesis model may be a speech synthesis model based on a large language model (LLM), or other speech synthesis models that include Transformer layers, or even speech synthesis models that do not include Transformer layers.
[0078] Taking the speech synthesis model based on a large language model as an example, in this embodiment, the target data can be first discretized into a token sequence, and then the token sequence is input into the speech synthesis model based on the large language model to obtain a speech discrete token sequence. Here, the speech discrete token sequence is a discrete semantic token sequence of the synthesized speech.
[0079] As introduced in the background art, in order to enable the speech synthesis model to more accurately adapt to multi-language scenarios, the speech synthesis model can integrate a Mixture of Experts (MoE) network. For example, the MoE can be integrated into each Transformer layer of the speech synthesis model based on a large language model. Of course, the MoE can also be integrated into some Transformer layers, such as only integrating it into the last few layers of the speech synthesis model based on a large language model.
[0080] Optionally, the way to integrate the MoE into the Transformer layer can be: replacing the feed-forward network layer of the Transformer layer with the MoE.
[0081] Unlike traditional MoE, the gated network of this application's MoE includes two branch networks. The first branch network is a traditional gated network, which generates a first expert routing result in the text feature dimension for the hidden state of the input. Here, the hidden state is the feature extracted from the target data by the network layers before the gated network, that is, the feature extracted from the target data by the network layers before the first branch network. This hidden state contains stylistic text context information.
[0082] The first expert routing result of the above-mentioned text feature dimension refers to the importance result of each expert network obtained from the perspective of distinguishing the importance of each expert network from the perspective of the text features (text features related to routing) of the target data. Wherein, if the target data includes target speech, the text features of the target speech refer to text-like features rather than real text features. The text-like features are obtained by transforming the token distribution of the target speech into text features through the mapping function learned by the model. This mapping function can map similar audio token distributions to similar text features and different audio token distributions to different text features.
[0083] That is, in this embodiment, the first branch network can learn, during the training process, the importance of different experts by using the text features (or text-like features) of the training hidden state to distinguish the importance of different experts, so that during inference, it can generate the first expert routing result in the text feature dimension for the hidden state of the input.
[0084] The aforementioned text features refer to the multi-level linguistic features extracted from text to guide speech generation, including but not limited to the following features: phoneme features, prosodic features, and pragmatic features (such as parts of speech, syntactic structure, sentence type, etc.).
[0085] For example, the first branch network includes a first linear transformation layer. Optionally, the first linear transformation layer is a simple linear layer or a shallow network, which can perform a first linear transformation on the input hidden state to obtain the first expert routing result, i.e.: ,in, This indicates the routing result from the first expert. Indicates a hidden state. This represents the text-level gating weight matrix. This represents the text-level gating bias vector.
[0086] Taking MoE, which includes num_expert expert networks, as an example, the dimension of the first expert routing result is [batch_size, L, num_expert], where batch_size represents the batch size and L represents the sequence length of the token sequence input to the model, i.e., the number of tokens.
[0087] Considering that the first branch network only distinguishes the importance of each expert network from the dimension of text features, its discrimination against languages with similar text features is insufficient, which may lead to incorrect routing due to language misidentification and reduce the accuracy of speech synthesis. To more accurately distinguish the language to be synthesized, this embodiment adds a second branch network in parallel with the first branch network. However, unlike the first branch network, the second branch network does not directly process the hidden state, but instead generates a second expert routing result based on the style feature dimension for the target data. The reasons include, but are not limited to: the model's internal processing of the target data; the hidden state obtained after layer-by-layer feature extraction may only contain a small number of style features. Extracting style features based on this to distinguish the importance of each expert network may result in a loss of discrimination due to insufficient style feature information. However, the original target data contains richer style features, which can more accurately distinguish the current language, and thus more accurately route to the matching expert network from the perspective of style features.
[0088] Optionally, the first expert routing result and the second expert routing result can both be the unnormalized original routing scores (logits) of each expert network, or the routing weights normalized by the softmax function from the logits.
[0089] In this embodiment, the first expert routing result and the second expert routing result jointly determine the target expert network to be activated. For example, if the first expert routing result and the second expert routing result are both original routing scores (logits), then the first expert routing result and the second expert routing result can be weighted and fused to obtain a fused routing weight. The fused routing weight includes the weight of each expert network in the MoE. Based on this, the k expert networks with the highest weights can be selected as the target expert networks, where k is a preset positive integer, such as 2.
[0090] Step S103: Perform speech synthesis based on the discrete speech marker sequence to obtain synthesized speech.
[0091] In this embodiment, the discrete speech marker sequence can be reconstructed into synthesized speech using a decoder or vocoder. Of course, there are other ways to obtain synthesized speech, and this application does not impose any specific limitations.
[0092] The multilingual speech synthesis method provided in this application obtains target data for speech synthesis and calls a speech synthesis model to generate a discrete speech marker sequence based on the target data. Considering that languages with similar text features often have significant differences in style expression, in order to fully utilize the differences in style expression among different languages, this application integrates a hybrid expert network (MoE) into the speech synthesis model. Furthermore, it adds another language-style-related gating network to the original gating network of MoE, forming a dual-gating network consisting of a first branch network and a second branch network. The first branch network retains its original function, generating a first expert routing result in the text feature dimension for the input hidden state. The second branch network generates a second expert routing result in the style feature dimension for the target data. The first and second expert routing results jointly determine the target expert network to be activated. Finally, speech synthesis is performed based on the discrete speech marker sequence to obtain synthesized speech. Therefore, this application not only utilizes the differences in text features between languages, but also the differences in style expression between languages. Since these two differences are inherent dimensions of difference between languages, expert routing is performed accordingly, achieving a natural balance of experts that matches the characteristics of the language. This avoids damaging the model's representational ability and the level of expert specialization, and also improves the accuracy of expert routing, thereby improving the accuracy of speech synthesis.
[0093] In some embodiments of this application, the process of the second branch network generating a second expert routing result in the style feature dimension for the target data in step S102 above is described.
[0094] As mentioned earlier, the first branch network is used to generate the first expert routing result in the text feature dimension. However, text features mainly contain semantic information and are insufficient in capturing acoustic features. To compensate for this deficiency, preferably, the target data includes the target speech, and the second branch network mainly generates the second expert routing result in the style feature dimension for the target speech.
[0095] In one possible implementation, such as Figure 3 The diagram shown is a schematic representation of a second-branch network. Figure 3 The second branch network includes an input layer, a reference encoder, a perceptual resampler, and an output layer.
[0096] The process by which the second branch network generates a second expert routing result in terms of style features for the target speech can include: preprocessing the target speech through the input layer to obtain its Mel spectrogram; encoding the acoustic features of the Mel spectrogram through a reference encoder to obtain target acoustic features; performing style resampling based on the target acoustic features through a perceptual resampler to obtain N latent style features, where N is a positive integer; and generating the second expert routing result based on the N latent style features through the output layer. The second expert routing result includes the routing results of each of the N latent style features corresponding to a specific expert network. Taking MoE as an example, which includes num_expert expert networks, the dimension of the second expert routing result is [batch_size, N, num_expert].
[0097] Optionally, the reference encoder can be composed of k convolutional neural networks and a pooling layer cascaded together, each convolutional neural network including convolution, batch normalization and ReLU activation functions.
[0098] Therefore, the process of "encoding the acoustic features of the Mel spectrogram using a reference encoder to obtain the target acoustic features" can include: downsampling the input acoustic features through each of the more than one sequentially connected convolutional neural networks to obtain the output acoustic features, where the input acoustic features of any convolutional neural network are the output acoustic features of the previous convolutional neural network, the input acoustic features of the first convolutional neural network are the Mel spectrogram, and global average pooling is performed on the output acoustic features of the last convolutional neural network through a pooling layer to obtain the target acoustic features.
[0099] In other words, this embodiment can extract acoustic features of different granularities from the Mel spectrogram by using k convolutional neural networks to downsample layer by layer. The final output of the k convolutional neural networks is then used to obtain acoustic features of fixed dimensions through global average pooling, which serve as the target acoustic features extracted from the Mel spectrogram by the reference encoder.
[0100] Considering that a speaker may change style within a speech, for example, the first sentence may be in a sad style and the second sentence may be in an angry style, in order to capture the multiple potential styles that the target speech may have, this embodiment collects N potential style features through a perceptual resampler.
[0101] like Figure 3Optionally, the perceptual resampler includes an attention layer, a first residual layer, a feedforward network layer, and a second residual layer. Based on this, the process of "resampling style based on target acoustic features using the perceptual resampler to obtain N potential style features" can include: performing cross-attention calculation on the target acoustic features and N learnable query vectors through the attention layer to obtain a query attention representation output by the attention layer; performing residual calculation on the target acoustic features and the query attention representation through the first residual layer to obtain a residual fusion representation; performing feedforward calculation on the residual fusion representation through the feedforward network layer to obtain a feedforward enhanced representation; and performing residual calculation on the residual fusion representation and the feedforward enhanced representation through the second residual layer to obtain N potential style features.
[0102] It should be noted that, Figure 3 The structures of the reference encoder and perceptual resampler in the example are only examples. Other structures can be used. For the reference encoder, any structure that can extract audio acoustic features from the Mel spectrogram is acceptable. Similarly, for the perceptual resampler, any structure that can extract audio style information based on audio acoustic features is acceptable. For example, optionally, normalization layers can be connected after the first residual layer and the second residual layer respectively to perform layer normalization processing on the output of the residual layer.
[0103] This embodiment combines the convolution of the reference encoder with the cross-attention feature extraction of the perceptual resampler. It can extract fine-grained acoustic style features such as language style features and rhythmic differences from the target speech in a prosodic modeling enhancement manner. This effectively makes up for the lack of semantic information in text features. It improves the accuracy of expert routing and expert natural balance with multi-dimensional information, thereby improving the accuracy of speech synthesis.
[0104] In another possible implementation, while language differentiation based on both textual and style features can lead to more accurate routing, the expert routing results in both dimensions are modeling results, and the modeling process is prone to errors, resulting in routing mistakes. To provide the gating network with a more accurate routing direction, this embodiment of the application may optionally provide a third branch network. Correspondingly, the target data needs to include the language identifier of the language to be synthesized. Then, the third branch network can generate a third expert routing result based on the language identifier in the language feature dimension. Since the language to be synthesized is unique, the dimension of the third expert routing result is [batch_size, 1, num_expert].
[0105] Optionally, the third branch network can adopt a structure similar to that of the first branch network. That is, the third branch network can include an embedding layer and a second linear transformation layer. Then, the process of the third branch network generating the third expert routing result in the language feature dimension based on the language identifier can include: converting the language identifier into a language vector representation through the embedding layer, and performing a second linear transformation on the language vector representation through the second linear transformation layer to obtain the third expert routing result.
[0106] Specifically, in this embodiment, a trainable language embedding layer can be created in the third branch network. The size of the embedding layer can be [V, D], where V represents the total number of supported languages and D represents the dimension of the embedding vector, i.e., num_expert.
[0107] Optionally, the language identifier can be a dialect identifier and / or a language identifier, such as 0 for Chinese, 1 for English, 2 for German, etc. For example, taking Chinese dialects as an example, the dialect identifier for Cantonese is 0, the dialect identifier for Sichuanese is 1, the dialect identifier for Shanghainese is 2, the dialect identifier for Minnan is 3, etc.
[0108] The aforementioned discrete integer identifiers can be transformed into dense, optimizable language vector representations through an embedding layer. Then, after further transformation by the second linear transformation layer, it can be... Mapping to AND Given a routing space of the same dimension, the following third-expert routing results are obtained: ,in, This indicates the routing results from the third expert. Represents the language-level gating weight matrix. Represents the language-level gating bias vector.
[0109] In the case where the gated network of MoE includes a first branch network, a second branch network, and a third branch network, this embodiment can determine the target expert network by the first expert routing result, the second expert routing result, and the third expert routing result.
[0110] This embodiment injects language identifiers as strong priors into the gating network, providing clearer directional guidance for routing decisions. Thus, in the early stages of training, without needing to explore language classification from scratch, samples of the same language are more likely to be routed to the same expert set, fundamentally avoiding the load imbalance problem caused by blind routing. Furthermore, routing samples of the same language to the same expert set allows each expert network to focus on finer-grained acoustic variations and routing strategies within the same language, accelerating the model's convergence process, shortening training time, and fostering a more stable and specialized expert network. This overall improves the stability and naturalness of the final synthesized speech.
[0111] Furthermore, the routing method that adds language identifiers is a natural balancing strategy that matches the data distribution of the language. It does not bring forced balancing like auxiliary loss, nor does it require all expert networks to be activated. It fully preserves the sparse activation characteristics of MoE and does not introduce any additional computational overhead during inference, achieving a perfect balance between optimization and efficiency.
[0112] In another possible implementation, considering that style features contain richer information than language features, the style feature dimension is more important than the language feature dimension during expert routing. To reflect this importance, this embodiment can make N greater than 1 when the gated network of MoE includes a third branch network, that is, the dimension of the second expert routing result is [batch_size, N, num_expert] (N>1), while the dimension of the third expert routing result is [batch_size, 1, num_expert].
[0113] Based on this, the process of "the first expert routing result, the second expert routing result, and the third expert routing result jointly determining the target expert network" may include: concatenating the third expert routing result with the second expert routing result in the sequence length dimension where N is located to obtain the concatenated routing result; performing a sequence length dimension mapping transformation on the concatenated routing result to align it with the first expert routing result in the sequence length dimension to obtain the transformed routing result; performing a weighted fusion of the transformed routing result and the first expert routing result to obtain the fused routing result; and determining the target expert network based on the fused routing result.
[0114] In other words, to emphasize that style features are more important than language features, the routing results from the third and second experts can be concatenated along the sequence length dimension (N) to obtain a concatenated routing result with dimensions [batch_size, N+1, num_expert]. Then, the total weight is shared between the concatenated routing result and the first expert routing result. Since the concatenated routing result has dimensions [batch_size, N+1, num_expert] while the first expert routing result has dimensions [batch_size, L, num_expert], they are not aligned along the sequence length dimension and cannot be directly shared. Therefore, a mapping transformation of the sequence length dimension is needed to obtain a transformed routing result with the same dimensions [batch_size, L, num_expert]. Finally, the transformed routing result is weighted and fused with the first expert routing result to obtain the fused routing result.
[0115] One possible weighted fusion method is to directly add the transformed routing result to the first expert routing result, and use the sum as the fused routing result.
[0116] Optionally, in order to adapt to various multilingual speech synthesis scenarios and thus more finely control the contribution of the two information sources to the final decision, this embodiment can introduce a learnable gating scalar. (Optionally, the initial value can be set to 0.5), or a complex attention network can be introduced to achieve adaptive weighted fusion of the transformed routing result and the first expert routing result, for example, ,in, This indicates the fused routing results. This indicates the routing result from the first expert. This indicates the routing result after the transformation. This represents a learnable gating scalar, which is a parameter that can be learned through model training.
[0117] By using learnable gating scalars, the model can automatically adjust its reliance on transient text features, language priors, and stylistic features as the training process progresses. For example, in the early stages of training, the model might rely more on language priors to quickly stabilize routing, so this can be adjusted... This is set to a relatively small value to increase the model's reliance on language priors. In the later stages of training, when experts become highly specialized, the value can be adjusted based on the model's reliance on textual and stylistic features. This allows for finer-grained routing adjustments. This adaptive capability makes the entire speech synthesis system more intelligent and robust.
[0118] In one optional embodiment, the training process of the speech synthesis model provided in this application includes two stages. In the first stage, a large-scale speech-text-language tag pairing dataset covering all languages is used to train the MoE-LLM speech synthesis model, which integrates a language- and style-aware gating network, end-to-end. The training objective remains the standard autoregressive speech tag prediction loss (such as cross-entropy loss). In this stage, the three branches of the gating network, all expert networks, and other parameters of the LLM participate in the learning process. The injection of language tags serves as a strong supervisory signal, guiding the gating network to quickly and accurately direct samples from different languages to different experts, effectively suppressing the load imbalance in the early stages of training and accelerating the expert specialization process. At the same time, style features, as highly discriminative information, can also guide the gating network to route accurately, thereby improving the accuracy of speech synthesis.
[0119] The second stage is the fine-tuning stage, which involves using speaker data in the target language to fine-tune the model. Here, speaker data refers to a high-quality speech-text-language identifier pairing dataset with pure pronunciation and rich prosodic style.
[0120] The above describes a multilingual speech synthesis method provided by the embodiments of this application. The following will describe the apparatus for performing the above multilingual speech synthesis method.
[0121] Please see Figure 4 , Figure 4 This is a schematic diagram of the structure of a multilingual speech synthesis device provided in an embodiment of this application. Figure 4 As shown, the multilingual speech synthesis device may include:
[0122] Data input unit 401 is used to acquire target data for speech synthesis;
[0123] The data processing unit 402 is used to call the speech synthesis model to generate a speech discrete marker sequence based on the target data. The speech synthesis model integrates a hybrid expert network MoE. The gate network of MoE includes a first branch network and a second branch network. The first branch network is used to generate a first expert routing result in the text feature dimension for the hidden state of the input. The hidden state is the feature extracted from the target data by the network layers before the gate network. The second branch network is used to generate a second expert routing result in the style feature dimension for the target data. The first expert routing result and the second expert routing result together determine the target expert network to be activated.
[0124] The data decoding unit 403 is used to perform speech synthesis based on the discrete speech marker sequence to obtain synthesized speech.
[0125] Each module in the aforementioned multilingual speech synthesis device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in the processor of a computer device in hardware form or independent of it, or stored in the memory of a computer device in software form, so that the processor can call and execute the operations corresponding to each module.
[0126] This application also provides an electronic device, which may include at least one processor and a memory connected to the processor, wherein:
[0127] Memory is used to store computer programs;
[0128] The processor is used to execute computer programs to enable electronic devices to implement any of the multilingual speech synthesis methods provided in the embodiments of this application.
[0129] refer to Figure 5 The diagram illustrates a structural schematic suitable for implementing the electronic device in the embodiments of this application. The electronic device in the embodiments of this application may include, but is not limited to, fixed terminals such as mobile phones, laptops, PDAs (personal digital assistants), PADs (tablet computers), desktop computers, etc. Figure 5The electronic device shown is merely an example and should not impose any limitation on the functionality and scope of use of the embodiments of this application.
[0130] like Figure 5 As shown, the electronic device may include a processing unit (e.g., a central processing unit, a graphics processing unit, etc.) 601, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 602 or a program loaded from a storage device 608 into a random access memory (RAM) 603. When the electronic device is powered on, the RAM 603 also stores various programs and data required for the operation of the electronic device. The processing unit 601, ROM 602, and RAM 603 are interconnected via a bus 604. An input / output (I / O) interface 605 is also connected to the bus 604.
[0131] Typically, the following devices can be connected to I / O interface 605: input devices 606 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 607 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 608 including, for example, memory cards, hard drives, etc.; and communication devices 609. Communication device 609 allows electronic devices to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 5 Electronic devices with various devices are shown, but it should be understood that it is not required to implement or have all of the devices shown. More or fewer devices may be implemented or have alternatively.
[0132] This application also provides a computer program product including computer-readable instructions, which, when executed on an electronic device, cause the electronic device to implement any of the multilingual speech synthesis methods provided in this application.
[0133] This application also provides a computer-readable storage medium that carries one or more computer programs. When the one or more computer programs are executed by an electronic device, the electronic device can implement any of the multilingual speech synthesis methods provided in this application.
[0134] It should also be noted that the device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. In addition, in the device embodiment drawings provided in this application, the connection relationship between modules indicates that they have a communication connection, which can be implemented as one or more communication buses or signal lines.
[0135] Through the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware, or it can be implemented by special-purpose hardware including application-specific integrated circuits, special-purpose CPUs, special-purpose memory, special-purpose components, etc. Generally, any function performed by a computer program can be easily implemented by corresponding hardware, and the specific hardware structure used to implement the same function can also be diverse, such as analog circuits, digital circuits, or special-purpose circuits. However, for this application, software program implementation is more often the preferred implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, ROM, RAM, magnetic disk, or optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, training equipment, or network device, etc.) to execute the methods described in the various embodiments of this application.
[0136] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product.
[0137] The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a training device or data center that integrates one or more available media. The available media may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., DVDs), or semiconductor media (e.g., solid-state drives (SSDs)).
Claims
1. A multilingual speech synthesis method, characterized in that, include: Obtain the target data for speech synthesis; A speech synthesis model is invoked to generate a discrete speech marker sequence based on the target data. The speech synthesis model integrates a hybrid expert network (MoE). The gated network of the MoE includes a first branch network and a second branch network. The first branch network is used to generate a first expert routing result in the text feature dimension for the hidden state of the input. The hidden state is the feature extracted from the target data by the network layers before the gated network. The second branch network is used to generate a second expert routing result in the style feature dimension for the target data. The first expert routing result and the second expert routing result together determine the target expert network to be activated. Speech synthesis is performed based on the discrete speech marker sequence to obtain synthesized speech.
2. The multilingual speech synthesis method according to claim 1, characterized in that, The target data includes target speech, and the second branch network generates a second expert routing result in the style feature dimension for the target speech, including: The target speech is preprocessed by the input layer to obtain the Mel spectrogram of the target speech; The target acoustic features are obtained by encoding the Mel spectrogram using a reference encoder. The target acoustic features are resampled using a perceptual resampler to obtain N potential style features, where N is a positive integer. The output layer generates the second expert routing result based on the N latent style features, wherein the second expert routing result includes the routing results of each expert network corresponding to the N latent style features.
3. The multilingual speech synthesis method according to claim 2, characterized in that, The step of encoding the acoustic features of the Mel spectrogram using a reference encoder to obtain the target acoustic features includes: The input acoustic features are downsampled by each of the convolutional neural networks in one or more sequentially connected convolutional neural networks to obtain the output acoustic features. The input acoustic features of any convolutional neural network are the output acoustic features of the preceding convolutional neural network, and the input acoustic features of the first convolutional neural network are the Mel spectrogram. The target acoustic features are obtained by performing global average pooling on the output acoustic features of the last convolutional neural network through a pooling layer.
4. The multilingual speech synthesis method according to claim 2, characterized in that, The process involves resampling the style based on the target acoustic features using a perceptual resampler to obtain N potential style features, including: The target acoustic features are cross-attentionally calculated with N learnable query vectors through an attention layer to obtain the query attention representation output by the attention layer. The residual fusion representation is obtained by performing residual calculation on the target acoustic features and the query attention representation through the first residual layer; The residual fusion representation is fed forward through a feedforward network layer to obtain a feedforward enhanced representation; The N latent style features are obtained by performing residual calculations on the residual fusion representation and the feedforward enhancement representation through the second residual layer.
5. The multilingual speech synthesis method according to claim 2, characterized in that, The gated network of the MoE also includes a third branch network, and the target data also includes a language identifier of the language to be synthesized. The third branch network is used to generate a third expert routing result in the language feature dimension based on the language identifier. The first expert routing result and the second expert routing result jointly determine the target expert network to be activated, including: The first expert routing result, the second expert routing result, and the third expert routing result together determine the target expert network.
6. The multilingual speech synthesis method according to claim 5, characterized in that, The third branch network generates a third expert routing result based on the language identifier in the language feature dimension, including: The language identifier is converted into a language vector representation through an embedding layer; The language vector representation is subjected to a second linear transformation through a second linear transformation layer to obtain the third expert routing result.
7. The multilingual speech synthesis method according to claim 6, characterized in that, When the gated network of the MoE includes the third branch network, N is greater than 1; The first expert routing result, the second expert routing result, and the third expert routing result jointly determine the target expert network, including: The third expert routing result and the second expert routing result are concatenated along the sequence length dimension where N is located to obtain the concatenated routing result. The concatenated routing result is transformed by mapping the sequence length dimension to align with the first expert routing result in the sequence length dimension, thus obtaining the transformed routing result; The transformed routing result and the first expert routing result are weighted and fused to obtain the fused routing result; The target expert network is determined based on the fused routing results.
8. The multilingual speech synthesis method according to any one of claims 5-7, characterized in that, The language identifier includes dialect identifiers and / or language identifiers.
9. A computer program product, characterized in that, It includes computer-readable instructions that, when executed on an electronic device, cause the electronic device to implement the multilingual speech synthesis method as described in any one of claims 1 to 8.
10. An electronic device, characterized in that, It includes at least one processor and a memory connected to the processor, wherein: The memory is used to store computer programs; The processor is used to execute the computer program to enable the electronic device to implement the multilingual speech synthesis method as described in any one of claims 1 to 8.
11. A computer storage medium, characterized in that, The storage medium carries one or more computer programs that, when executed by an electronic device, enable the electronic device to implement the multilingual speech synthesis method as described in any one of claims 1 to 8.