Text-to-speech method and apparatus, and electronic device and storage medium
By training an emotion coding model and utilizing minimum mutual information loss and contrastive learning loss, irrelevant information is extracted and removed to construct fine-grained emotion features. This solves the problem of insufficient emotion control in existing speech synthesis models and achieves efficient speech synthesis control and flexibility.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- IFLYTEK CO LTD
- Filing Date
- 2025-07-11
- Publication Date
- 2026-07-02
AI Technical Summary
Existing speech synthesis models lack fine-grained control over emotions, resulting in insufficient control over speech synthesis.
By training an emotion coding model and utilizing minimum mutual information loss and contrastive learning loss, irrelevant information in speech and text content is extracted and removed, fine-grained emotion features are constructed, and the target speech is generated by combining speech feature extraction and decoding networks.
It achieves fine-grained control over emotions, improves the control and flexibility of speech synthesis, and can generate target speech with specific emotions based on user input.
Smart Images

Figure CN2025108245_02072026_PF_FP_ABST
Abstract
Description
Speech synthesis methods, devices, electronic devices and storage media
[0001] Cross-reference of related applications
[0002] This application claims priority to Chinese Patent Application No. 202411906533.6, filed on December 23, 2024, entitled “Speech Synthesis Method, Apparatus, Electronic Device and Storage Medium”, which is incorporated herein by reference in its entirety. Technical Field
[0003] This disclosure relates to the field of speech technology, and in particular to a speech synthesis method, apparatus, electronic device, and storage medium. Background Technology
[0004] In recent years, text-to-speech (TTS) has been widely used in scenarios such as smart assistants, speakers, in-vehicle systems, novel reading, and short video dubbing. These open-domain demands for text synthesis have placed new requirements on the controllability and diversity of text synthesis.
[0005] In related technologies, high-quality audio database recordings are typically used to train speech synthesis models. However, the trained speech synthesis models lack fine-grained control over emotions, thereby reducing the controllability of speech synthesis. Summary of the Invention
[0006] This disclosure provides a speech synthesis method, apparatus, electronic device, and storage medium to address the shortcomings of related technologies in reducing the controllability of speech synthesis.
[0007] This disclosure provides a speech synthesis method, including the following steps.
[0008] Obtain the text to be synthesized and its sentiment attributes;
[0009] The text to be synthesized and the emotional attributes are input into the speech synthesis model to obtain the target speech output by the speech synthesis model;
[0010] The speech synthesis model is trained based on the first sample text corresponding to the first sample speech and the first sample emotion features corresponding to the first sample speech. The first sample emotion features are obtained by inputting the first sample speech into the emotion coding model. The emotion coding model is trained based on the minimum mutual information loss of the target encoding and the second sample emotion features. The target encoding includes the timbre encoding of the second sample speech and / or the text content encoding of the second sample text of the second sample speech. The second sample emotion features are obtained by inputting the second sample speech into the initial emotion coding model.
[0011] According to the speech synthesis method provided in this disclosure, obtaining emotional attributes includes:
[0012] Upon receiving attribute description text input by the user, the attribute description text is input into the text big model to obtain the sentiment attribute output by the text big model;
[0013] Upon receiving the emotional template speech input by the user, the emotional template speech is input into the emotional coding model to obtain the emotional features output by the emotional coding model, and the emotional features are determined as the emotional attribute.
[0014] According to a speech synthesis method provided in this disclosure, the emotion coding model is trained in the following manner:
[0015] The second sample speech is input into the speech coding network of the initial emotion coding model to obtain the semantic features output by the speech coding network;
[0016] The preset learnable emotion space and the semantic features are input into the speech coding network of the initial emotion coding model to obtain the second sample emotion features output by the emotion coding network.
[0017] Determine the first minimum mutual information loss between the second sample sentiment features and the timbre encoding, and determine the second minimum mutual information loss between the second sample sentiment features and the text content encoding;
[0018] Based on the first minimum mutual information loss and the second minimum mutual information loss, the network parameters of the emotion coding network are updated to obtain the emotion coding model.
[0019] According to a speech synthesis method provided in this disclosure, updating the network parameters of the emotion coding network based on the first minimum mutual information loss and the second minimum mutual information loss to obtain the emotion coding model includes:
[0020] The emotional description text corresponding to the second sample speech is input into the description text encoding network of the initial emotional encoding model to obtain the emotional description code output by the description text encoding network;
[0021] Based on the sentiment features of the second sample and the sentiment description encoding, determine the contrastive learning loss;
[0022] Based on the first minimum mutual information loss, the second minimum mutual information loss, and the contrastive learning loss, the network parameters of the sentiment encoding network and the network parameters of the descriptive text encoding network are updated to obtain the sentiment encoding model.
[0023] According to a speech synthesis method provided in this disclosure, the speech synthesis model is trained in the following manner:
[0024] The first sample text and the first sample emotion features are input into the speech feature extraction network of the initial speech synthesis model to obtain the sample speech features output by the speech feature extraction network.
[0025] The sample speech features are input into the speech decoding network of the initial speech synthesis model to obtain the first predicted speech output by the speech decoding network;
[0026] Based on the first predicted speech and the first sample speech, the network parameters of the speech feature extraction network and the network parameters of the speech decoding network are updated to obtain the speech synthesis model.
[0027] According to a speech synthesis method provided in this disclosure, the step of inputting the text to be synthesized and the emotional attribute into a speech synthesis model to obtain the target speech output by the speech synthesis model includes:
[0028] The text to be synthesized, the emotional attribute, and other attributes are input into the speech synthesis model to obtain the target speech output by the speech synthesis model; the other attributes include at least one of the following: target speech environment, target sound quality level, target language, and target speech style; the sample speech features are obtained by inputting the first sample text, the first sample emotional feature, and other sample attributes corresponding to the first sample speech into the speech feature extraction network.
[0029] The other sample attributes include at least one of the following: the sample speech environment corresponding to the first sample speech, the sample sound quality level corresponding to the first sample speech, the sample language corresponding to the first sample speech, and the sample speech style corresponding to the first sample speech.
[0030] According to a speech synthesis method provided in this disclosure, the step of updating the network parameters of the speech feature extraction network and the network parameters of the speech decoding network based on the first predicted speech and the first sample speech to obtain the speech synthesis model includes:
[0031] Based on the first predicted speech and the first sample speech, the network parameters of the speech feature extraction network and the network parameters of the speech decoding network are updated to obtain a reference speech synthesis model.
[0032] Obtain the third sample text corresponding to the third sample speech and the third sample emotional features corresponding to the third sample speech, wherein the third sample speech is the speech recorded by the sample object in the sample location;
[0033] The third sample text, the third sample emotional features, and the identifier of the sample object are input into the reference speech synthesis model to obtain the second predicted speech output by the reference speech synthesis model.
[0034] Based on the second predicted speech and the third sample speech, the network parameters of the speech feature extraction network and the speech decoding network in the reference speech synthesis model are updated to obtain the speech synthesis model.
[0035] This disclosure also provides a speech synthesis apparatus, including:
[0036] The acquisition unit is used to acquire the text to be synthesized and its sentiment attributes;
[0037] A synthesis unit is used to input the text to be synthesized and the emotional attributes into a speech synthesis model to obtain the target speech output by the speech synthesis model.
[0038] The speech synthesis model is trained based on the first sample text corresponding to the first sample speech and the first sample emotion features corresponding to the first sample speech. The first sample emotion features are obtained by inputting the first sample speech into the emotion coding model. The emotion coding model is trained based on the minimum mutual information loss of the target encoding and the second sample emotion features. The target encoding includes the timbre encoding of the second sample speech and / or the text content encoding of the second sample text of the second sample speech. The second sample emotion features are obtained by inputting the second sample speech into the initial emotion coding model.
[0039] This disclosure also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement any of the above-described speech synthesis methods.
[0040] This disclosure also provides a non-transitory computer-readable storage medium having a computer program stored thereon that, when executed by a processor, implements the speech synthesis method as described above.
[0041] This disclosure also provides a computer program product, including a computer program that, when executed by a processor, implements the speech synthesis method as described above.
[0042] The speech synthesis method, apparatus, electronic device, and storage medium disclosed herein input the acquired text to be synthesized and its emotional attributes into a trained speech synthesis model to obtain the target speech output by the speech synthesis model. The speech synthesis model is trained based on a first sample text corresponding to a first sample speech and first sample emotional features corresponding to the first sample speech. The first sample emotional features are obtained by inputting the first sample speech into an emotional coding model. The emotional coding model is trained based on the minimum mutual information loss of the target coding and the second sample emotional features. The target coding includes the timbre coding of the second sample speech and / or the text content coding of the second sample text of the second sample speech. The second sample emotional features are obtained by inputting the second sample speech into an initial emotional coding model. It is understood that this disclosure can train an emotional coding model based on the minimum mutual information loss of the target coding and the second sample emotional features, ensuring that the emotional features output by the emotional coding model do not include irrelevant information such as timbre and text content. This allows the speech synthesis model trained based on the emotional features output by the emotional coding model and the sample text to achieve fine-grained emotional control, thereby improving the controllability of speech synthesis. Attached Figure Description
[0043] To more clearly illustrate the technical solutions in this disclosure or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0044] Figure 1 is a schematic flowchart of the speech synthesis method provided in an embodiment of this disclosure.
[0045] Figure 2 is an example diagram of speech synthesis provided in an embodiment of this disclosure.
[0046] Figure 3 is one of the schematic diagrams of the training process of the emotion coding model provided in the embodiments of this disclosure.
[0047] Figure 4 is a schematic diagram of the public speech training set provided in an embodiment of this disclosure.
[0048] Figure 5 is a second schematic diagram of the training process of the emotion coding model provided in the embodiments of this disclosure.
[0049] Figure 6 is a schematic diagram of the framework of the initial emotion coding model provided in the embodiments of this disclosure.
[0050] Figure 7 is one of the schematic diagrams of the training process of the speech synthesis model provided in the embodiments of this disclosure.
[0051] Figure 8 is a schematic diagram of the framework of the initial speech synthesis model provided in the embodiments of this disclosure.
[0052] Figure 9 is a second schematic diagram of the training process of the speech synthesis model provided in the embodiments of this disclosure.
[0053] Figure 10 is a schematic diagram of the structure of the speech synthesis device provided in the embodiments of this disclosure.
[0054] Figure 11 is a schematic diagram of the physical structure of the electronic device provided in an embodiment of this disclosure. Detailed Implementation
[0055] To make the objectives, technical solutions, and advantages of this disclosure clearer, the technical solutions of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this disclosure, not all embodiments. Based on the embodiments of this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this disclosure.
[0056] The speech synthesis method of this disclosure is described below with reference to Figures 1-9. The subject executing this speech synthesis method can be an electronic device such as a terminal, tablet computer, or computer, or a speech synthesis device installed in the electronic device. The speech synthesis device can be implemented through software, hardware, or a combination of both.
[0057] Figure 1 is a flowchart illustrating the speech synthesis method provided in this embodiment of the present disclosure. As shown in Figure 1, the speech synthesis method includes the following steps:
[0058] Step 101: Obtain the text to be synthesized and its emotional attributes.
[0059] For example, when a user has a need for speech synthesis, the text to be synthesized and the emotional attributes can be input into the electronic device, so that the electronic device can obtain the text to be synthesized and the emotional attributes. Alternatively, the electronic device can retrieve the text to be synthesized and the emotional attributes stored in memory. For example, the emotional attributes include sadness or happiness.
[0060] Step 102: Input the text to be synthesized and the emotional attributes into the speech synthesis model to obtain the target speech output by the speech synthesis model.
[0061] The speech synthesis model is trained based on the first sample text corresponding to the first sample speech and the first sample emotion features corresponding to the first sample speech. The first sample emotion features are obtained by inputting the first sample speech into the emotion coding model. The emotion coding model is trained based on the minimum mutual information loss of the target encoding and the second sample emotion features. The target encoding includes the timbre encoding of the second sample speech and / or the text content encoding of the second sample text of the second sample speech. The second sample emotion features are obtained by inputting the second sample speech into the initial emotion coding model.
[0062] For example, when the text to be synthesized and the emotional attributes are obtained, the text to be synthesized and the emotional attributes are input into the trained speech synthesis model. The speech synthesis model generates the target speech based on the text to be synthesized and the emotional attributes and outputs it, so that the output target speech includes not only the semantics of the text to be synthesized, but also the emotion corresponding to the emotional attributes.
[0063] It should be noted that the identifier of the target speaker can also be input into the speech synthesis model, so that the speech synthesis model finally outputs the target speech using the timbre of the target speaker. This disclosure does not limit this.
[0064] The speech synthesis method disclosed herein inputs the acquired text to be synthesized and its emotional attributes into a trained speech synthesis model to obtain the target speech output by the speech synthesis model. The speech synthesis model is trained based on the first sample text corresponding to the first sample speech and the first sample emotional features corresponding to the first sample speech. The first sample emotional features are obtained by inputting the first sample speech into an emotional encoding model. The emotional encoding model is trained based on the minimum mutual information loss of the target encoding and the second sample emotional features. The target encoding includes the timbre encoding of the second sample speech and / or the text content encoding of the second sample text. The second sample emotional features are obtained by inputting the second sample speech into an initial emotional encoding model. Therefore, this disclosure can train an emotional encoding model based on the minimum mutual information loss of the target encoding and the second sample emotional features, ensuring that the emotional features output by the emotional encoding model do not include irrelevant information such as timbre and text content. This allows the speech synthesis model trained based on the emotional features output by the emotional encoding model and the sample text to achieve fine-grained emotional control, thereby improving the controllability of speech synthesis.
[0065] In one embodiment, obtaining the emotional attribute in step 101 above can be achieved in the following way:
[0066] The first method involves receiving attribute description text input by the user, inputting the attribute description text into a large text model, and obtaining the sentiment attribute output by the large text model.
[0067] For example, the user inputs attribute description text to control the emotional attributes of the speech synthesis. This attribute description text can be input into a large text model, which parses it and, through prompt design, outputs the emotional and other attributes contained within it. For instance, if the user inputs "Generate high-quality speech, spoken in English, news style, happy," the large text model parses it into "Sound quality level: high, Language: English, Speech style: news, Emotion: happy." For attributes not mentioned in the attribute description text, system default attributes can be used, such as "Speech environment: recording studio." Then, the parsed emotional attributes, other attributes, and default attributes are input together into the speech synthesis model to obtain the target speech output by the model.
[0068] The second method involves receiving the emotional template speech input by the user, inputting the emotional template speech into the emotional coding model, obtaining the emotional features output by the emotional coding model, and determining the emotional features as the emotional attribute.
[0069] For example, when receiving an emotional template speech input from a user, it indicates that the user wants to control the emotion of the synthesized target speech through the emotional template speech. Under this condition, the speech environment, sound quality level, language, and speech style adopt specific values or default values of the speech attributes specified by the user; for the emotion attribute, the emotional template speech is input into a trained emotion coding model (also known as a fine-grained emotion encoding model), and the emotion coding model extracts the emotion features from the emotional template speech, and determines the emotion feature as the emotion attribute. Then, the parsed emotion attribute and the specific values or default values of the user-specified speech attributes are input together into the speech synthesis model to obtain the target speech output by the speech synthesis model.
[0070] The third method involves the user directly inputting emotional attributes and other attributes. The electronic device then inputs the specific values of these attributes into the speech synthesis model to obtain the target speech output by the model. Attributes not specified by the user are assigned default values.
[0071] Figure 2 is an example diagram of speech synthesis provided in the embodiments of this disclosure. As shown in Figure 2, different speech attributes can be obtained by any of the three methods described above. The speech attributes, the text to be synthesized, and the identifier of the target speaker are input into the speech synthesis model so that the speech synthesis model finally outputs the target speech obtained by using the timbre of the target speaker.
[0072] In this embodiment, the emotional attributes that the user wants to control in speech synthesis can be obtained based on different forms of user input, thus improving the flexibility of speech synthesis.
[0073] In one embodiment, Figure 3 is a schematic diagram of the training process of the sentiment encoding model provided in this disclosure. As shown in Figure 3, the sentiment encoding model is trained in the following manner:
[0074] Step 301: Input the second sample speech into the speech coding network of the initial emotion coding model to obtain the semantic features output by the speech coding network.
[0075] For example, each sample speech from a public speech training set and a high-quality speech training set can be used as a second sample speech. The second sample speech is then input into the speech coding network of the initial emotion coding model, and semantic features are extracted from the second sample speech through the speech coding network.
[0076] It should be noted that publicly available speech training data (pretrain-data) can be obtained in the following ways: The quality requirements for publicly available speech training data are not high, but coverage of various speech attributes needs to be considered. Publicly available speech can be recorded or crawled from the internet. Publicly available speech needs to cover various speech environments, speech styles, and speech sources. This disclosure categorizes speech attributes into five main types: speech environment, sound quality level, language, speech style, and fine-grained emotion, and obtains these attributes through the following three different methods:
[0077] The first approach involves using open-source detection tools to automatically determine the category of a speech based on its environment, sound quality level, and language. The environment includes recording studios, conference rooms, lecture halls, and outdoor settings. Sound quality levels are categorized as high, medium, and low. Languages are classified based on actual data coverage, including Chinese, English, Russian, Japanese, Sichuanese, Northeastern Mandarin, or Cantonese.
[0078] The second approach involves automatically determining the voice style based on the data acquisition channel. For example, podcast data sources can obtain tag information from websites such as finance, history, news, and health, while audiobook data sources can be categorized by genre such as fantasy, history, martial arts, suspense, and horror.
[0079] The third approach addresses the challenge of automatically acquiring fine-grained sentence-level annotations for nuanced emotions. Instead, a combination of manual annotation and self-supervised pre-training can be used. For details, refer to the following description of the trained emotion coding model, which can then output emotion features. Figure 4 is a schematic diagram of the public speech training set provided in this embodiment. As shown in Figure 4, the public speech training set includes multiple sample speech, the corresponding speech environment, sound quality level, language, speech style, and fine-grained emotions for each sample speech.
[0080] It should be noted that high-quality speech training set fine-tuning data can be obtained in the following way: the sample speech in the high-quality speech training set is high-quality speech collected by professional speakers in a recording studio using professional equipment. Compared with the diverse speech environment, language, speech style, and emotional coverage of pretrain data, the high-quality speech training set can cover only part of the emotion and speech style, and the sample speech in the high-quality speech training set is annotated with emotional description text. The high-quality speech training set is used to build a speech synthesis model for the target speaker.
[0081] Step 302: Input the preset learnable sentiment space and the semantic features into the sentiment encoding network of the initial sentiment encoding model to obtain the second sample sentiment features output by the sentiment encoding network.
[0082] For example, due to the limited number of emotion types, a pre-designed fixed-dimensional learnable emotion space is designed. The pre-designed learnable emotion space consists of multiple emotion vectors, each representing a different emotion. The pre-designed learnable emotion space is input into the emotion encoding network of the initial emotion encoding model. Under normal circumstances, the emotion encoding network can combine the emotions represented by the pre-designed learnable emotion space to obtain more emotions. However, since more emotions may include emotions unrelated to the input second sample speech, it is necessary to exchange information between the emotion encoding network and the speech encoding network. This allows the emotion encoding network to obtain the semantic features output by the speech encoding network. Then, based on the semantic features and the emotions represented by the pre-designed learnable emotion space, the network finally outputs the second sample emotion features that match the semantic features, also known as the emotion token.
[0083] In addition, the second sample speech is input into the timbre encoding network of the initial emotion encoding model to obtain the timbre encoding output by the timbre encoding network, and the second sample text corresponding to the second sample speech is input into the text encoding network of the initial emotion encoding model to obtain the text content encoding output by the text encoding network.
[0084] Step 303: Determine the first minimum mutual information loss between the second sample emotional features and the timbre encoding, and determine the second minimum mutual information loss between the second sample emotional features and the text content encoding.
[0085] For example, when obtaining the second sample emotional features, the timbre encoding, and the text content encoding corresponding to the second sample speech, the first minimum mutual information loss L of the second sample emotional features and timbre encoding is calculated based on the following formula (1). timbre First minimum mutual information loss L timbreThe goal is to remove timbre-related information from the emotional features of the second sample. Specifically, vCLUB estimation can be used.
[0086] Among them, e i Let x represent the sentiment feature of the second sample speech corresponding to the i-th frame of the second sample speech. i x represents the timbre code corresponding to the i-th frame of the second sample speech. j q(x) represents the timbre encoding of the j-th frame of the second sample speech. i |e i ) indicates in e i Given x i The probability distribution, q(x) j |e i ) indicates in e i Given x j The probability distribution is given by , where n represents the total number of frames in the second sample speech, also known as the batch size. The first minimum mutual information loss L... timbre The physical meaning expressed is that, given the emotional features e of the second sample, i At that time, the second sample's emotional feature e i The probability of the timbre code x belonging to any other frame of speech is as consistent as possible. That is, the second sample emotion feature e i It has nothing to do with timbre.
[0087] And based on the following formula (2), the second minimum mutual information loss L of the sentiment features of the second sample and the text content encoding is calculated. content The second minimum mutual information loss L content The goal is to remove text content-related information from the sentiment features of the second sample, which can be achieved using vCLUB estimation.
[0088] Among them, t i t represents the text content encoding corresponding to the i-th frame of the second sample speech. j q(t) represents the text content encoding corresponding to the j-th frame of the second sample speech. i |e i ) indicates in e i Given time t i The probability distribution of q(t) j |e i ) indicates in e i Given time t j The probability distribution. The second minimum mutual information loss L. content The physical meaning expressed is that, given the emotional features e of the second sample, i At that time, the second sample's emotional feature e iThe probability of the text content encoding t belonging to any other frame of speech is as consistent as possible. That is, the second sample sentiment feature e i It is unrelated to the text content.
[0089] Step 304: Based on the first minimum mutual information loss and the second minimum mutual information loss, update the network parameters of the emotion coding network to obtain the emotion coding model.
[0090] For example, when the first minimum mutual information loss and the second minimum mutual information loss are obtained, the network parameters of the sentiment coding network can be updated by the sum of the first minimum mutual information loss and the second minimum mutual information loss until the convergence condition is met, and finally the sentiment coding model is obtained.
[0091] In this embodiment, the network parameters of the emotion coding network can be updated based on the first minimum mutual information loss of the second sample emotion features and timbre encoding, and the second minimum mutual information loss of the second sample emotion features and text content encoding. This enables the extraction of emotion-related information from sample speech, and by using the minimum mutual information decoupling training criterion, information unrelated to emotion, such as text content and timbre, is removed, thereby improving the purity of emotion information. This also improves the purity of the emotion features output by the trained emotion coding model and enhances the decoupling control capability of different speech attributes.
[0092] In one embodiment, Figure 5 is a second schematic diagram of the training process of the sentiment coding model provided in this embodiment. As shown in Figure 5, step 304 updates the network parameters of the sentiment coding network based on the first minimum mutual information loss and the second minimum mutual information loss to obtain the sentiment coding model. This can be achieved through the following steps:
[0093] Step 3041: Input the sentiment description text corresponding to the second sample speech into the description text encoding network of the initial sentiment encoding model to obtain the sentiment description code output by the description text encoding network.
[0094] For example, the annotator needs to annotate the emotion of the second sample speech. Based on their own perception of the second sample speech, the annotator uses a descriptive word or text that they feel is appropriate as the annotation result. The descriptive word can be one or multiple. For example, the annotation result is "a feeling of slight joy, lively tone, and relatively fast speech speed". This provides a fine-grained description of the emotion of the second sample speech, and the annotation result is called the emotion description text corresponding to the second sample speech. The emotion description text corresponding to the second sample speech is input into the description text encoding network of the initial emotion encoding model. The emotion description text is encoded by the description text encoding network to obtain the emotion description code output by the description text encoding network.
[0095] Step 3042: Determine the contrastive learning loss based on the sentiment features of the second sample and the sentiment description encoding.
[0096] For example, when obtaining the second sample emotional features and emotional description code corresponding to the second sample speech, the contrastive learning loss is calculated based on the following formula (3). The purpose of the contrastive learning loss is to align the second sample emotional features and the corresponding emotional description code to the same space. Specifically, the second sample emotional features are used as anchor points, and the corresponding emotional description codes are used as positive and negative examples. InfoNCE loss is used for training, and the contrastive learning loss is:
[0097] Where, d i d represents the emotion description code corresponding to the i-th frame of the second sample speech. j Let f represent the sentiment description code corresponding to the j-th frame of the second sample speech, f be the distance metric function (which can be a dot product), and E be the expectation. The contrastive learning loss is calculated as the average of the training set and can be approximated using Monte Carlo estimation, typically implemented through batch-based gradient descent training. The physical meaning of the contrastive learning loss is that, among all sentiment description codes, the sentiment feature of the second sample is closest to the sentiment description code that matches it.
[0098] Step 3043: Based on the first minimum mutual information loss, the second minimum mutual information loss, and the contrastive learning loss, update the network parameters of the sentiment encoding network and the network parameters of the descriptive text encoding network to obtain the sentiment encoding model.
[0099] For example, after obtaining the first minimum mutual information loss, the second minimum mutual information loss, and the contrastive learning loss, the total loss Loss can be determined based on the following formula (4). The network parameters of the sentiment encoding network and the description text encoding network are updated based on the total loss Loss until the convergence condition is met, and finally the sentiment encoding model is obtained. Loss = L contrast +L timbre +L content (4)
[0100] Figure 6 is a schematic diagram of the framework of the initial emotion coding model provided in this embodiment of the present disclosure. As shown in Figure 6, the initial emotion coding model includes a speech coding network (also known as a speech information coding network), an emotion coding network (also known as an emotion and mood coding network), a text coding network, a timbre coding network, and a descriptive text coding network. Specifically, speech is input into the speech coding network to obtain the semantic features output by the speech coding network. A preset learnable emotion space is input into the emotion coding network and interacts with the semantic features of the speech coding network to finally obtain the emotion token output by the emotion coding network. The emotional description text of the speech is input into the descriptive text coding network to obtain the emotional description code (also known as the description code) output by the descriptive text coding network. The text content corresponding to the speech is input into the text coding network to obtain the text content code output by the text coding network. Finally, the speech is input into the timbre coding network to obtain the timbre code output by the timbre coding network.
[0101] It should be noted that the network parameters of the emotion encoding network and the descriptive text encoding network in the initial emotion encoding model need to be trained, while the parameters of other modules can use pre-trained models. The emotion encoding network and the descriptive text encoding network can use traditional text pre-trained models, such as the Bidirectional Encoder Representations from Transformers (BERT). The speech encoding network can use traditional speech pre-trained models, such as HuBERT and wav2vec; the text encoding network can use traditional text pre-trained models, such as BERT; and the timbre encoding network can use traditional voiceprint models, such as xvector.
[0102] In this embodiment, the network parameters of the emotion coding network and the descriptive text coding network are updated by combining the contrastive learning loss and the minimum mutual information loss, based on the first minimum mutual information loss of the second sample emotion features and timbre encoding, the second minimum mutual information loss of the second sample emotion features and text content encoding, and the contrastive learning loss determined based on the second sample emotion features and emotion description encoding. This not only improves the purity of the emotion features output by the trained emotion coding model but also improves the accuracy of the emotion description encoding output by the emotion coding model. Furthermore, for unlabeled sample speech, the trained emotion coding model can output accurate emotion features, thereby reducing the workload of annotation.
[0103] In one embodiment, Figure 7 is a schematic diagram of the training process of the speech synthesis model provided in this disclosure. As shown in Figure 7, the speech synthesis model is trained in the following manner:
[0104] Step 701: Input the first sample text and the first sample emotion features into the speech feature extraction network of the initial speech synthesis model to obtain the sample speech features output by the speech feature extraction network.
[0105] The initial speech synthesis model can be VALL-E, Fish-speech, or chatTTS, etc., and this disclosure does not limit it.
[0106] For example, Figure 8 is a schematic diagram of the framework of the initial speech synthesis model provided in this embodiment. As shown in Figure 8, the initial speech synthesis model includes a speech feature extraction network for text-to-semantic token conversion and a speech decoding network. For example, the speech feature extraction network can be a GPT (Generative Pre-trained Transformer) model. The speech feature extraction network determines the content and prosodic style of the synthesized speech. The speech attribute control module mainly operates on the speech feature extraction network. Specifically, it inputs sample text and corresponding emotional features to the sample text. Additionally, the synthesized speech from the previous frame can be input into the speech feature extraction network, allowing it to use the synthesized speech from the previous frame as a reference to generate the synthesized speech for the current frame, thereby improving the accuracy of the synthesized speech.
[0107] For example, each sample speech in the publicly available speech training set can be used as the first sample speech. The first sample text corresponding to the first sample speech and the first sample emotion feature corresponding to the first sample speech can be input into the speech feature extraction network of the initial speech synthesis model to obtain the sample speech features output by the speech feature extraction network.
[0108] Step 702: Input the sample speech features into the speech decoding network of the initial speech synthesis model to obtain the first predicted speech output by the speech decoding network.
[0109] For example, when the sample speech features output by the speech feature extraction network are obtained, the sample speech features are input into the speech decoding network of the initial speech synthesis model. The speech decoding network decodes the sample speech features to obtain the first predicted speech corresponding to the first sample text.
[0110] Step 703: Based on the first predicted speech and the first sample speech, update the network parameters of the speech feature extraction network and the network parameters of the speech decoding network to obtain the speech synthesis model.
[0111] For example, when the first predicted speech corresponding to each first sample text is obtained, a first loss function is constructed based on the first predicted speech corresponding to each first sample text and the first sample speech (speech label) corresponding to each first sample text. The network parameters of the speech feature extraction network and the network parameters of the speech decoding network are updated based on the first loss function until the convergence condition is reached, and finally the speech synthesis model is obtained.
[0112] In this embodiment, the initial speech synthesis model can be trained based on the sample text and emotional features corresponding to each sample speech in the publicly available speech training set, so that the final speech synthesis model can achieve fine-grained emotional control, thereby improving the controllability of speech synthesis.
[0113] In one embodiment, step 102 above inputs the text to be synthesized and the emotional attribute into the speech synthesis model to obtain the target speech output by the speech synthesis model, which can be specifically implemented in the following way:
[0114] The text to be synthesized, the emotional attribute, and other attributes are input into the speech synthesis model to obtain the target speech output by the speech synthesis model; the other attributes include at least one of the following: target speech environment, target sound quality level, target language, and target speech style, and the sample speech features are obtained by inputting the first sample text, the first sample emotional feature, and other sample attributes corresponding to the first sample speech into the speech feature extraction network.
[0115] The other sample attributes include at least one of the following: the sample speech environment corresponding to the first sample speech, the sample sound quality level corresponding to the first sample speech, the sample language corresponding to the first sample speech, and the sample speech style corresponding to the first sample speech.
[0116] For example, in the stage of training the initial speech synthesis model based on the public speech training set, the obtained first sample speech corresponding to the sample speech environment, sample sound quality level, sample language and sample speech style, first sample emotional features, first sample speech corresponding to the first sample text and the identifier of the speaker corresponding to the first sample speech can be concatenated to obtain concatenated information. The concatenated information is then input into the initial speech synthesis model for training.
[0117] Similarly, the text to be synthesized, emotional attributes, target speech environment, target sound quality level, target language, target speech style, and the identifier of the target speaker can be concatenated to obtain target concatenation information. This target concatenation information can then be input into the speech synthesis model to obtain the target speech of the target speaker output by the speech synthesis model.
[0118] It should be noted that the first sample sentiment feature corresponding to the first sample speech can be obtained in different ways: for sample speech with sentiment description text annotation, the sentiment description encoding of the annotated sentiment description text is directly used as the first sample sentiment feature (sentiment token) corresponding to the first sample speech; for sample speech without sentiment description text annotation, the sample speech is input into the trained sentiment encoding model to obtain the sentiment token output by the sentiment encoding model, and this sentiment token is used as the first sample sentiment feature corresponding to the first sample speech.
[0119] In this embodiment, the text to be synthesized, emotional attributes, target speech environment, target sound quality level, target language, target speech style, and the identifier of the target speaker can be input into the speech synthesis model to obtain the target speech of the target speaker output by the speech synthesis model. This realizes the control of different speech attributes by the speech synthesis model and further improves the controllability of speech synthesis. In addition, this disclosure uniformly divides speech attributes into emotion, speech environment, sound quality level, language, and speech style, resulting in a more comprehensive speech attribute classification system, which enables the speech synthesis model to control different speech attributes, that is, to control speech attributes in various dimensions.
[0120] In one embodiment, Figure 9 is a second schematic diagram of the training process of the speech synthesis model provided in this embodiment. As shown in Figure 9, step 703 updates the network parameters of the speech feature extraction network and the network parameters of the speech decoding network based on the first predicted speech and the first sample speech to obtain the speech synthesis model. Specifically, this can be achieved in the following ways:
[0121] Step 7031: Based on the first predicted speech and the first sample speech, update the network parameters of the speech feature extraction network and the network parameters of the speech decoding network to obtain a reference speech synthesis model.
[0122] For example, when the first predicted speech corresponding to each first sample text is obtained, a first loss function is constructed based on the first predicted speech corresponding to each first sample text and the first sample speech (speech label) corresponding to each first sample text. The network parameters of the speech feature extraction network and the network parameters of the speech decoding network are updated based on the first loss function until the convergence condition is reached, and finally the speech synthesis model is referenced.
[0123] Step 7032: Obtain the third sample text corresponding to the third sample speech and the third sample emotional features corresponding to the third sample speech, wherein the third sample speech is the speech recorded by the sample object in the sample location.
[0124] For example, each sample speech in the high-quality speech training set can be used as the third sample speech, and the corresponding third sample text can be obtained from the high-quality speech training set. The third sample sentiment features corresponding to the third sample speech can be obtained in different ways: for third sample speech with sentiment description text annotations, the sentiment description encoding of the annotation text is directly used as the third sample sentiment feature (sentiment token); for third sample speech without sentiment description text annotations, the third sample speech is input into a trained sentiment encoding model to obtain the sentiment token output by the sentiment encoding model, and this sentiment token is used as the third sample sentiment feature.
[0125] Step 7033: Input the third sample text, the third sample emotional features, and the identifier of the sample object into the reference speech synthesis model to obtain the second predicted speech output by the reference speech synthesis model.
[0126] For example, when the third sample text and the third sample emotion feature corresponding to the third sample speech are obtained, the third sample text, the third sample emotion feature and the identifier of the sample object are input into the reference speech synthesis model to obtain the second predicted speech corresponding to the third sample text output by the reference speech synthesis model.
[0127] Step 7034: Based on the second predicted speech and the third sample speech, update the network parameters of the speech feature extraction network and the speech decoding network in the reference speech synthesis model to obtain the speech synthesis model.
[0128] For example, when the second predicted speech corresponding to each third sample text is obtained, a second loss function is constructed based on the second predicted speech corresponding to each third sample text and the third sample speech (speech label) corresponding to each third sample text. The network parameters of the speech feature extraction network and the speech decoding network in the reference speech synthesis model are updated based on the second loss function until the convergence condition is reached, and finally the speech synthesis model is obtained.
[0129] It should be noted that since the public speech training set contains a variety of different speech attributes, the reference speech synthesis model trained on the public speech training set can cover the control of a variety of different speech attributes. In other words, the reference speech synthesis model has the performance to control a variety of different speech attributes. Therefore, even if the high-quality speech training set has fewer types of speech attributes, the final trained speech synthesis model can still have the control of other uncovered speech attributes, thus realizing attribute transfer control speech synthesis.
[0130] It should be noted that the speech synthesis model can also be trained based on a first proportion of sample speech in a high-quality speech training set and a second proportion of sample speech in a publicly available speech training set, wherein the first proportion is greater than the second proportion, for example, the first proportion is 80% and the second proportion is 20%, and this disclosure does not limit this.
[0131] It should be noted that this disclosure further subdivides the methods for adding speech attributes. For speech attributes such as speech environment, language, and speech style, the user control granularity is coarse, and a coarse-grained discrete category representation is adopted. For emotion attributes, the user needs fine-grained control, and a contrastive learning loss pre-training method is used to pull the emotion features and emotion description encoding into the same space. Combined with the constraint of minimum mutual information loss, the information irrelevant to emotion in the emotion features is further reduced, and the purity of the emotion features is improved.
[0132] In summary, the speech synthesis method disclosed herein divides the information in speech, excluding text, into speech attributes such as speech environment, sound quality level, language, speech style, emotion, and speaker timbre. Speech attributes are labeled through automatic annotation or self-supervised decoupled pre-training, and then input into the speech synthesis model for control, thereby achieving control over the attributes of the synthesized speech. Specifically, for speech environment, sound quality level, language, and speech style, since the control precision required by users is not high, labels can be designed and automatically identified by information sources or tools. For fine-grained emotion, a large amount of publicly available speech and emotion description text is pre-collected, and fixed-length fine-grained emotion features are extracted. Through comparative learning of pre-training and the minimum mutual information constraints of text content and timbre, the fine-grained emotion encoding space is aligned with the text sentence-level semantic encoding space, and attribute-decoupled fine-grained emotion features are extracted. After obtaining the different attribute labels for the speech, they are input together with the text into the speech synthesis model to train a speech synthesis model base with controllable speech attributes. Finally, fine-tuning is performed using data from the target speaker to obtain a speech synthesis model with controllable speech attributes for the target speaker. During the inference phase, the user-inputted text to be synthesized and the specified attributes are input into the trained speech synthesis model, enabling control over different attributes of speech synthesis and improving the controllability and flexibility of the speech synthesis model.
[0133] The speech synthesis apparatus provided in this disclosure is described below. The speech synthesis apparatus described below can be referred to in correspondence with the speech synthesis method described above.
[0134] Figure 10 is a schematic diagram of the structure of a speech synthesis device provided in an embodiment of this disclosure. As shown in Figure 10, the speech synthesis device 1000 includes an acquisition unit 1001 and a synthesis unit 1002; wherein:
[0135] Acquisition unit 1001 is used to acquire the text to be synthesized and its sentiment attributes;
[0136] Synthesis unit 1002 is used to input the text to be synthesized and the emotional attribute into the speech synthesis model to obtain the target speech output by the speech synthesis model.
[0137] The speech synthesis model is trained based on the first sample text corresponding to the first sample speech and the first sample emotion features corresponding to the first sample speech. The first sample emotion features are obtained by inputting the first sample speech into the emotion coding model. The emotion coding model is trained based on the minimum mutual information loss of the target encoding and the second sample emotion features. The target encoding includes the timbre encoding of the second sample speech and / or the text content encoding of the second sample text of the second sample speech. The second sample emotion features are obtained by inputting the second sample speech into the initial emotion coding model.
[0138] The speech synthesis apparatus provided in this disclosure inputs the acquired text to be synthesized and its emotional attributes into a trained speech synthesis model to obtain the target speech output by the speech synthesis model. The speech synthesis model is trained based on the first sample text corresponding to the first sample speech and the first sample emotional features corresponding to the first sample speech. The first sample emotional features are obtained by inputting the first sample speech into an emotional coding model. The emotional coding model is trained based on the minimum mutual information loss of the target coding and the second sample emotional features. The target coding includes the timbre coding of the second sample speech and / or the text content coding of the second sample text of the second sample speech. The second sample emotional features are obtained by inputting the second sample speech into an initial emotional coding model. It is understood that this disclosure can train an emotional coding model based on the minimum mutual information loss of the target coding and the second sample emotional features, ensuring that the emotional features output by the emotional coding model do not include irrelevant information such as timbre and text content. This allows the speech synthesis model trained based on the emotional features output by the emotional coding model and the sample text to achieve fine-grained emotional control, thereby improving the controllability of speech synthesis.
[0139] Based on any of the above embodiments, the acquisition unit 1001 is specifically used for:
[0140] Upon receiving attribute description text input by the user, the attribute description text is input into the text big model to obtain the sentiment attribute output by the text big model;
[0141] Upon receiving the emotional template speech input by the user, the emotional template speech is input into the emotional coding model to obtain the emotional features output by the emotional coding model, and the emotional features are determined as the emotional attribute.
[0142] Based on any of the above embodiments, the emotion encoding model is trained in the following manner:
[0143] The second sample speech is input into the speech coding network of the initial emotion coding model to obtain the semantic features output by the speech coding network;
[0144] The preset learnable emotion space and the semantic features are input into the speech coding network of the initial emotion coding model to obtain the second sample emotion features output by the emotion coding network.
[0145] Determine the first minimum mutual information loss between the second sample sentiment features and the timbre encoding, and determine the second minimum mutual information loss between the second sample sentiment features and the text content encoding;
[0146] Based on the first minimum mutual information loss and the second minimum mutual information loss, the network parameters of the emotion coding network are updated to obtain the emotion coding model.
[0147] Based on any of the above embodiments, updating the network parameters of the emotion coding network based on the first minimum mutual information loss and the second minimum mutual information loss to obtain the emotion coding model includes:
[0148] The emotional description text corresponding to the second sample speech is input into the description text encoding network of the initial emotional encoding model to obtain the emotional description code output by the description text encoding network;
[0149] Based on the sentiment features of the second sample and the sentiment description encoding, determine the contrastive learning loss;
[0150] Based on the first minimum mutual information loss, the second minimum mutual information loss, and the contrastive learning loss, the network parameters of the sentiment encoding network and the network parameters of the descriptive text encoding network are updated to obtain the sentiment encoding model.
[0151] Based on any of the above embodiments, the speech synthesis model is trained in the following manner:
[0152] The first sample text and the first sample emotion features are input into the speech feature extraction network of the initial speech synthesis model to obtain the sample speech features output by the speech feature extraction network.
[0153] The sample speech features are input into the speech decoding network of the initial speech synthesis model to obtain the first predicted speech output by the speech decoding network;
[0154] Based on the first predicted speech and the first sample speech, the network parameters of the speech feature extraction network and the network parameters of the speech decoding network are updated to obtain the speech synthesis model.
[0155] Based on any of the above embodiments, the synthesis unit 1002 is specifically used for:
[0156] The text to be synthesized, the emotional attribute, and other attributes are input into the speech synthesis model to obtain the target speech output by the speech synthesis model; the other attributes include at least one of the following: target speech environment, target sound quality level, target language, and target speech style; the sample speech features are obtained by inputting the first sample text, the first sample emotional feature, and other sample attributes corresponding to the first sample speech into the speech feature extraction network.
[0157] The other sample attributes include at least one of the following: the sample speech environment corresponding to the first sample speech, the sample sound quality level corresponding to the first sample speech, the sample language corresponding to the first sample speech, and the sample speech style corresponding to the first sample speech.
[0158] Based on any of the above embodiments, updating the network parameters of the speech feature extraction network and the network parameters of the speech decoding network based on the first predicted speech and the first sample speech to obtain the speech synthesis model includes:
[0159] Based on the first predicted speech and the first sample speech, the network parameters of the speech feature extraction network and the network parameters of the speech decoding network are updated to obtain a reference speech synthesis model.
[0160] Obtain the third sample text corresponding to the third sample speech and the third sample emotional features corresponding to the third sample speech, wherein the third sample speech is the speech recorded by the sample object in the sample location;
[0161] The third sample text, the third sample emotional features, and the identifier of the sample object are input into the reference speech synthesis model to obtain the second predicted speech output by the reference speech synthesis model.
[0162] Based on the second predicted speech and the third sample speech, the network parameters of the speech feature extraction network and the speech decoding network in the reference speech synthesis model are updated to obtain the speech synthesis model.
[0163] Figure 11 is a schematic diagram of the physical structure of an electronic device provided in an embodiment of this disclosure. As shown in Figure 11, the electronic device may include: a processor 1110, a communications interface 1120, a memory 1130, and a communication bus 1140. The processor 1110, communications interface 1120, and memory 1130 communicate with each other via the communication bus 1140. The processor 1110 can call logical instructions in the memory 1130 to execute a speech synthesis method, which includes: acquiring the text to be synthesized and emotional attributes.
[0164] The text to be synthesized and the emotional attributes are input into the speech synthesis model to obtain the target speech output by the speech synthesis model;
[0165] The speech synthesis model is trained based on the first sample text corresponding to the first sample speech and the first sample emotion features corresponding to the first sample speech. The first sample emotion features are obtained by inputting the first sample speech into the emotion coding model. The emotion coding model is trained based on the minimum mutual information loss of the target encoding and the second sample emotion features. The target encoding includes the timbre encoding of the second sample speech and / or the text content encoding of the second sample text of the second sample speech. The second sample emotion features are obtained by inputting the second sample speech into the initial emotion coding model.
[0166] Furthermore, the logical instructions in the aforementioned memory 1130 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this disclosure, in essence, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this disclosure. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0167] On the other hand, this disclosure also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer is able to execute the speech synthesis method provided by the above methods, which includes: acquiring the text to be synthesized and emotional attributes.
[0168] The text to be synthesized and the emotional attributes are input into the speech synthesis model to obtain the target speech output by the speech synthesis model;
[0169] The speech synthesis model is trained based on the first sample text corresponding to the first sample speech and the first sample emotion features corresponding to the first sample speech. The first sample emotion features are obtained by inputting the first sample speech into the emotion coding model. The emotion coding model is trained based on the minimum mutual information loss of the target encoding and the second sample emotion features. The target encoding includes the timbre encoding of the second sample speech and / or the text content encoding of the second sample text of the second sample speech. The second sample emotion features are obtained by inputting the second sample speech into the initial emotion coding model.
[0170] In another aspect, this disclosure also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to perform the speech synthesis methods provided by the above methods, the method comprising: acquiring text to be synthesized and emotional attributes;
[0171] The text to be synthesized and the emotional attributes are input into the speech synthesis model to obtain the target speech output by the speech synthesis model;
[0172] The speech synthesis model is trained based on the first sample text corresponding to the first sample speech and the first sample emotion features corresponding to the first sample speech. The first sample emotion features are obtained by inputting the first sample speech into the emotion coding model. The emotion coding model is trained based on the minimum mutual information loss of the target encoding and the second sample emotion features. The target encoding includes the timbre encoding of the second sample speech and / or the text content encoding of the second sample text of the second sample speech. The second sample emotion features are obtained by inputting the second sample speech into the initial emotion coding model.
[0173] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0174] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0175] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this disclosure, and are not intended to limit them. Although this disclosure has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this disclosure.
Claims
1. A speech synthesis method, comprising: Obtain the text to be synthesized and its sentiment attributes; The text to be synthesized and the emotional attributes are input into the speech synthesis model to obtain the target speech output by the speech synthesis model; The speech synthesis model is trained based on the first sample text corresponding to the first sample speech and the first sample emotion features corresponding to the first sample speech. The first sample emotion features are obtained by inputting the first sample speech into the emotion coding model. The emotion coding model is trained based on the minimum mutual information loss of the target encoding and the second sample emotion features. The target encoding includes the timbre encoding of the second sample speech and / or the text content encoding of the second sample text of the second sample speech. The second sample emotion features are obtained by inputting the second sample speech into the initial emotion coding model.
2. The speech synthesis method according to claim 1, obtaining emotional attributes includes: Upon receiving attribute description text input by the user, the attribute description text is input into the text big model to obtain the sentiment attribute output by the text big model; Upon receiving the emotional template speech input by the user, the emotional template speech is input into the emotional coding model to obtain the emotional features output by the emotional coding model, and the emotional features are determined as the emotional attribute.
3. The speech synthesis method according to claim 1, wherein the emotion coding model is trained in the following manner: The second sample speech is input into the speech coding network of the initial emotion coding model to obtain the semantic features output by the speech coding network; The preset learnable emotion space and the semantic features are input into the speech coding network of the initial emotion coding model to obtain the second sample emotion features output by the emotion coding network. Determine the first minimum mutual information loss between the second sample sentiment features and the timbre encoding, and determine the second minimum mutual information loss between the second sample sentiment features and the text content encoding; Based on the first minimum mutual information loss and the second minimum mutual information loss, the network parameters of the emotion coding network are updated to obtain the emotion coding model.
4. The speech synthesis method according to claim 3, wherein updating the network parameters of the emotion coding network based on the first minimum mutual information loss and the second minimum mutual information loss to obtain the emotion coding model includes: The emotional description text corresponding to the second sample speech is input into the description text encoding network of the initial emotional encoding model to obtain the emotional description code output by the description text encoding network; Based on the sentiment features of the second sample and the sentiment description encoding, determine the contrastive learning loss; Based on the first minimum mutual information loss, the second minimum mutual information loss, and the contrastive learning loss, the network parameters of the sentiment encoding network and the network parameters of the descriptive text encoding network are updated to obtain the sentiment encoding model.
5. The speech synthesis method according to any one of claims 1-4, wherein the speech synthesis model is trained in the following manner: The first sample text and the first sample emotion features are input into the speech feature extraction network of the initial speech synthesis model to obtain the sample speech features output by the speech feature extraction network. The sample speech features are input into the speech decoding network of the initial speech synthesis model to obtain the first predicted speech output by the speech decoding network; Based on the first predicted speech and the first sample speech, the network parameters of the speech feature extraction network and the network parameters of the speech decoding network are updated to obtain the speech synthesis model.
6. The speech synthesis method according to claim 5, wherein inputting the text to be synthesized and the emotional attribute into the speech synthesis model to obtain the target speech output by the speech synthesis model includes: The text to be synthesized, the emotional attribute, and other attributes are input into the speech synthesis model to obtain the target speech output by the speech synthesis model; the other attributes include at least one of the following: target speech environment, target sound quality level, target language, and target speech style; the sample speech features are obtained by inputting the first sample text, the first sample emotional feature, and other sample attributes corresponding to the first sample speech into the speech feature extraction network. The other sample attributes include at least one of the following: the sample speech environment corresponding to the first sample speech, the sample sound quality level corresponding to the first sample speech, the sample language corresponding to the first sample speech, and the sample speech style corresponding to the first sample speech.
7. The speech synthesis method according to claim 5, wherein updating the network parameters of the speech feature extraction network and the network parameters of the speech decoding network based on the first predicted speech and the first sample speech to obtain the speech synthesis model comprises: Based on the first predicted speech and the first sample speech, the network parameters of the speech feature extraction network and the network parameters of the speech decoding network are updated to obtain a reference speech synthesis model. Obtain the third sample text corresponding to the third sample speech and the third sample emotional features corresponding to the third sample speech, wherein the third sample speech is the speech recorded by the sample object in the sample location; The third sample text, the third sample emotional features, and the identifier of the sample object are input into the reference speech synthesis model to obtain the second predicted speech output by the reference speech synthesis model. Based on the second predicted speech and the third sample speech, the network parameters of the speech feature extraction network and the speech decoding network in the reference speech synthesis model are updated to obtain the speech synthesis model.
8. A speech synthesis device, comprising: The acquisition unit is used to acquire the text to be synthesized and its sentiment attributes; A synthesis unit is used to input the text to be synthesized and the emotional attributes into a speech synthesis model to obtain the target speech output by the speech synthesis model. The speech synthesis model is trained based on the first sample text corresponding to the first sample speech and the first sample emotion features corresponding to the first sample speech. The first sample emotion features are obtained by inputting the first sample speech into the emotion coding model. The emotion coding model is trained based on the minimum mutual information loss of the target encoding and the second sample emotion features. The target encoding includes the timbre encoding of the second sample speech and / or the text content encoding of the second sample text of the second sample speech. The second sample emotion features are obtained by inputting the second sample speech into the initial emotion coding model.
9. The speech synthesis apparatus according to claim 8, wherein the acquisition unit is specifically used for: Upon receiving attribute description text input by the user, the attribute description text is input into the text big model to obtain the sentiment attribute output by the text big model; Upon receiving the emotional template speech input by the user, the emotional template speech is input into the emotional coding model to obtain the emotional features output by the emotional coding model, and the emotional features are determined as the emotional attribute.
10. The speech synthesis apparatus according to claim 8, wherein the emotion coding model is trained in the following manner: The second sample speech is input into the speech coding network of the initial emotion coding model to obtain the semantic features output by the speech coding network; The preset learnable emotion space and the semantic features are input into the speech coding network of the initial emotion coding model to obtain the second sample emotion features output by the emotion coding network. Determine the first minimum mutual information loss between the second sample sentiment features and the timbre encoding, and determine the second minimum mutual information loss between the second sample sentiment features and the text content encoding; Based on the first minimum mutual information loss and the second minimum mutual information loss, the network parameters of the emotion coding network are updated to obtain the emotion coding model.
11. The speech synthesis apparatus according to claim 10, wherein updating the network parameters of the emotion coding network based on the first minimum mutual information loss and the second minimum mutual information loss to obtain the emotion coding model comprises: The emotional description text corresponding to the second sample speech is input into the description text encoding network of the initial emotional encoding model to obtain the emotional description code output by the description text encoding network; Based on the sentiment features of the second sample and the sentiment description encoding, determine the contrastive learning loss; Based on the first minimum mutual information loss, the second minimum mutual information loss, and the contrastive learning loss, the network parameters of the sentiment encoding network and the network parameters of the descriptive text encoding network are updated to obtain the sentiment encoding model.
12. The speech synthesis apparatus according to claims 8-11, wherein the speech synthesis model is trained in the following manner: The first sample text and the first sample emotion features are input into the speech feature extraction network of the initial speech synthesis model to obtain the sample speech features output by the speech feature extraction network. The sample speech features are input into the speech decoding network of the initial speech synthesis model to obtain the first predicted speech output by the speech decoding network; Based on the first predicted speech and the first sample speech, the network parameters of the speech feature extraction network and the network parameters of the speech decoding network are updated to obtain the speech synthesis model.
13. The speech synthesis apparatus according to claim 12, wherein the synthesis unit is specifically used for: The text to be synthesized, the emotional attribute, and other attributes are input into the speech synthesis model to obtain the target speech output by the speech synthesis model; the other attributes include at least one of the following: target speech environment, target sound quality level, target language, and target speech style; the sample speech features are obtained by inputting the first sample text, the first sample emotional feature, and other sample attributes corresponding to the first sample speech into the speech feature extraction network. in, The other sample attributes include at least one of the following: the sample speech environment corresponding to the first sample speech, the sample sound quality level corresponding to the first sample speech, the sample language corresponding to the first sample speech, and the sample speech style corresponding to the first sample speech.
14. The speech synthesis apparatus according to claim 12, wherein updating the network parameters of the speech feature extraction network and the network parameters of the speech decoding network based on the first predicted speech and the first sample speech to obtain the speech synthesis model comprises: Based on the first predicted speech and the first sample speech, the network parameters of the speech feature extraction network and the network parameters of the speech decoding network are updated to obtain a reference speech synthesis model. Obtain the third sample text corresponding to the third sample speech and the third sample emotional features corresponding to the third sample speech, wherein the third sample speech is the speech recorded by the sample object in the sample location; The third sample text, the third sample emotional features, and the identifier of the sample object are input into the reference speech synthesis model to obtain the second predicted speech output by the reference speech synthesis model. Based on the second predicted speech and the third sample speech, the network parameters of the speech feature extraction network and the speech decoding network in the reference speech synthesis model are updated to obtain the speech synthesis model.
15. An electronic device comprising a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor, when executing the computer program, implements the speech synthesis method as claimed in any one of claims 1 to 7.
16. A non-transitory computer-readable storage medium having a computer program stored thereon, the computer program being executed by a processor to implement the speech synthesis method as described in any one of claims 1 to 7.