Model training method, audio processing method, electronic device and storage medium
By training a target emotion dimension prediction model, the discrete feature vectors of the emotion speech dataset are mapped to a multi-dimensional emotion space, which solves the problems of high training cost and imprecise emotion control in emotion speech synthesis systems, and realizes high-quality and diverse emotion speech generation.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- ALIBABA (CHINA) CO LTD
- Filing Date
- 2025-08-15
- Publication Date
- 2026-07-02
AI Technical Summary
Existing emotional speech synthesis systems rely on emotional speech data, which is costly to train and lacks precise emotional control, making it difficult to simulate the complex and ever-changing emotional states of humans.
By acquiring an emotional speech dataset and a predefined anchor set, a target emotional dimension prediction model is trained. The discrete emotional feature vectors of the emotional speech dataset are mapped to a multi-dimensional emotional space, guiding the emotional speech synthesis model to predict acoustic details and reducing dependence on emotional speech data.
It achieves low-cost training of emotion dimension prediction and speech synthesis models, improves the precision of emotion control and the naturalness of complex emotion simulation, and generates high-quality, diverse emotional speech.
Smart Images

Figure CN2025115123_02072026_PF_FP_ABST
Abstract
Description
Model training methods, audio processing methods, electronic devices and storage media Technical Field
[0001] This disclosure relates to audio processing technology and large model technology, specifically to a model training method, an audio processing method, an electronic device, and a storage medium. Background Technology
[0002] With the rapid development of artificial intelligence and natural language processing technologies, the research and application of emotional text-to-speech (TTS) systems have attracted increasing attention. Emotional text-to-speech aims to convey emotional information through computer-generated speech to enhance the naturalness of human-computer interaction and user experience.
[0003] Currently, emotional speech synthesis systems face significant challenges in mimicking complex human emotions, primarily due to the scarcity of emotional speech data and the limitations of models in capturing emotional features. Current systems require large amounts of emotionally tagged speech data for training. The high cost and complexity of acquiring this data restrict model training and optimization, consequently affecting the emotional expressiveness and naturalness of the synthesized speech. Furthermore, current systems lack finesse in controlling the emotional dimension, making it difficult to accurately simulate the complex and ever-changing emotional states of humans.
[0004] There is currently no effective solution to the above problems. Summary of the Invention
[0005] This disclosure provides a model training method, an audio processing method, an electronic device, and a storage medium to at least solve the technical problems in related technologies, such as the reliance on emotional speech data, high training costs, and poor precision in emotional control and complex emotional simulation.
[0006] According to one aspect of the present disclosure, a model training method is provided, comprising: acquiring an emotional speech dataset and a predefined anchor set, wherein the predefined anchor set is used to define multiple different types of emotion categories; training an initial emotion dimension prediction model based on the emotional speech dataset and the predefined anchor set to generate a target emotion dimension prediction model, wherein the target emotion dimension prediction model is used to map discrete emotion feature vectors corresponding to the emotional speech dataset to a multi-dimensional emotion space to guide the emotional speech synthesis model to predict the acoustic details corresponding to the prompt audio.
[0007] According to another aspect of the embodiments of this disclosure, a model training method is also provided, comprising: acquiring training text, training audio, and training text transcription corresponding to the training audio; using a target sentiment dimension prediction model to predict the sentiment dimension of the training audio to obtain a sentiment dimension vector corresponding to the training audio, wherein the target sentiment dimension prediction model is obtained by using any of the above-described model training methods; training an initial sentiment speech synthesis model based on the training text, training audio, training text transcription, and sentiment dimension vector to generate a target sentiment speech synthesis model, wherein the target sentiment speech synthesis model is used to perform sentiment speech synthesis on target text, prompt audio, and prompt text transcription corresponding to the prompt audio to obtain target sentiment speech.
[0008] According to another aspect of the present disclosure, an audio processing method is also provided, comprising: acquiring target text, prompt audio, and a prompt text transcription corresponding to the prompt audio; and performing emotional speech synthesis on the target text, prompt audio, and prompt text transcription using a target emotional speech synthesis model to obtain target emotional speech; wherein the target emotional speech synthesis model is obtained by using any of the above-described model training methods.
[0009] According to another aspect of the embodiments of this disclosure, an audio processing method is also provided, comprising: acquiring input text to be converted into virtual customer service voice, virtual customer service prompt audio, and prompt text transcription corresponding to the virtual customer service prompt audio; performing emotional speech synthesis on the input text, virtual customer service prompt audio, and prompt text transcription using a target emotional speech synthesis model to obtain virtual customer service emotional voice; wherein, the target emotional speech synthesis model is obtained by using any of the above-mentioned model training methods.
[0010] According to another aspect of the embodiments of this disclosure, an audio processing method is also provided, comprising: obtaining an audio processing request through a first application programming interface, wherein the request data carried in the audio processing request includes: target text, prompt audio, and a prompt text transcription corresponding to the prompt audio; and returning an audio processing response through a second application programming interface, wherein the response data carried in the audio processing response includes: target emotional speech, wherein the target emotional speech is obtained by performing emotional speech synthesis on the target text, prompt audio, and prompt text transcription using a target emotional speech synthesis model, and the target emotional speech synthesis model is obtained by using any of the above-mentioned model training methods.
[0011] According to another aspect of the embodiments of this disclosure, an audio processing method is also provided, comprising: acquiring a currently input audio processing dialogue request, wherein the request data carried in the audio processing dialogue request includes: target text, prompt audio, and a transcribed prompt text corresponding to the prompt audio; responding to the audio processing dialogue request and returning an audio processing dialogue response, wherein the information carried in the audio processing dialogue response includes: target emotional speech, wherein the target emotional speech is obtained by performing emotional speech synthesis on the target text, prompt audio, and transcribed prompt text using a target emotional speech synthesis model, and the target emotional speech synthesis model is obtained using any of the above-described model training methods; and playing the target emotional speech through a speaker device.
[0012] According to another aspect of the present disclosure, an electronic device is also provided, including: a memory storing an executable program; and a processor connected to the memory via a bus for running the program, wherein the program executes the methods in various embodiments of the present disclosure during runtime.
[0013] According to another aspect of the embodiments of the present disclosure, a computer-readable storage medium is also provided, the computer-readable storage medium including a stored executable program, wherein, when the executable program is executed, it controls the device where the computer-readable storage medium is located to perform the methods of the various embodiments of the present disclosure.
[0014] According to another aspect of the embodiments of this disclosure, a computer program product is also provided, including a computer program that, when executed by a processor, implements the methods of various embodiments of this disclosure.
[0015] In this embodiment, by acquiring an emotional speech dataset and a predefined anchor set, and then training an initial emotional dimension prediction model based on the emotional speech dataset and the predefined anchor set, a target emotional dimension prediction model is generated. The trained target emotional dimension prediction model is used to map the discrete emotional feature vectors corresponding to the emotional speech dataset to a multi-dimensional emotional space, thereby guiding the emotional speech synthesis model to predict the acoustic details corresponding to the prompt audio. This achieves the goal of training the emotional dimension prediction model and the emotional speech synthesis model at low cost, thus realizing that the model training of the emotional dimension prediction model and the emotional speech synthesis model does not depend on emotional speech data. The emotional dimension prediction model improves the precision of emotional control and the naturalness of expression in complex emotional simulation of the emotional speech synthesis model, enabling the generation of high-quality and diverse emotional speech. This solves the technical problems of emotional speech synthesis systems in related technologies, such as dependence on emotional speech data, high training costs, and poor precision of emotional control and complex emotional simulation.
[0016] It is worth noting that the above general description and the following detailed description are merely for illustrative and explanatory purposes and do not constitute a limitation thereof. Attached Figure Description
[0017] The accompanying drawings, which are included to provide a further understanding of this disclosure and form part of this disclosure, illustrate exemplary embodiments of the present disclosure and are used to explain the disclosure, but do not constitute an undue limitation of the disclosure. In the drawings:
[0018] Figure 1 is a schematic diagram of an application scenario of a model training method according to an embodiment of the present disclosure;
[0019] Figure 2 is a flowchart of a model training method according to an embodiment of the present disclosure;
[0020] Figure 3 is a schematic diagram of training an emotion dimension prediction model according to an embodiment of the present disclosure;
[0021] Figure 4 is a flowchart of a model training method according to an embodiment of the present disclosure;
[0022] Figure 5 is a schematic diagram of the framework of a speech synthesis system according to an embodiment of the present disclosure;
[0023] Figure 6 is a flowchart of an audio processing method according to an embodiment of the present disclosure;
[0024] Figure 7 is a flowchart of an audio processing method according to an embodiment of the present disclosure;
[0025] Figure 8 is a flowchart of an audio processing method according to an embodiment of the present disclosure;
[0026] Figure 9 is a flowchart of an audio processing method according to an embodiment of the present disclosure;
[0027] Figure 10 is a schematic diagram of a model training device according to an embodiment of the present disclosure;
[0028] Figure 11 is a schematic diagram of another model training device according to an embodiment of the present disclosure;
[0029] Figure 12 is a schematic diagram of an audio processing apparatus according to an embodiment of the present disclosure;
[0030] Figure 13 is a schematic diagram of another model training device according to an embodiment of the present disclosure;
[0031] Figure 14 is a schematic diagram of another model training device according to an embodiment of the present disclosure;
[0032] Figure 15 is a schematic diagram of another model training device according to an embodiment of the present disclosure;
[0033] Figure 16 is a structural block diagram of a computing device according to an embodiment of the present disclosure;
[0034] Figure 17 is a structural block diagram of an electronic device according to an embodiment of the present disclosure. Detailed Implementation
[0035] To enable those skilled in the art to better understand the present disclosure, the technical solutions of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments. Obviously, the described embodiments are only some embodiments of the present disclosure, and not all embodiments. Based on the embodiments of the present disclosure, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present disclosure.
[0036] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this disclosure are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this disclosure described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0037] The technical solution disclosed herein is primarily implemented using large-scale model technology. Here, "large-scale model" refers to a deep learning model with a massive number of parameters, typically containing hundreds of millions, tens of billions, hundreds of billions, trillions, or even tens of trillions of parameters. Large-scale models, also known as foundation models, are pre-trained using large-scale unlabeled corpora to produce pre-trained models with hundreds of millions of parameters. These models are adaptable to a wide range of downstream tasks and exhibit good generalization ability. Examples include Large Language Models (LLMs) and multi-modal pre-training models.
[0038] It should be noted that, in practical applications, large models can be fine-tuned using a small number of samples to adapt them to different tasks. For example, large models can be widely applied in Natural Language Processing (NLP), computer vision, and speech processing. Specifically, they can be applied to computer vision tasks such as Visual Question Answering (VQA), Image Captioning (IC), and Image Generation, as well as NLP tasks such as text-based sentiment classification, text summarization, and machine translation. Therefore, the main application scenarios for large models include, but are not limited to, digital assistants, intelligent robots, search, online education, office software, e-commerce, and intelligent design. In this embodiment, the example of using a target sentiment dimension prediction model trained using the model training method of this disclosure for audio processing is provided for illustration.
[0039] First, some nouns or terms that appear in the description of the embodiments of this disclosure shall be interpreted as follows:
[0040] Emotional Text-to-Speech (TTS): A speech synthesis technology that generates speech with emotional features, making the synthesized speech more similar to human expression.
[0041] Pleasure, arousal, and dominance are three dimensions used in psychology to describe emotions. Pleasure refers to the positive or negative nature of an emotion, arousal refers to the intensity of an emotion, and dominance refers to the perception of control or power. These dimensions help in fine-tuning emotional styles.
[0042] Self-Supervised Learning (SSL): A machine learning method that does not require labeled data. It generates supervisory signals through the structure of the data itself and is often used to extract deep features from the data.
[0043] Dimensionality Reduction: The process of mapping high-dimensional data to a low-dimensional space, with the aim of preserving important information and simplifying model complexity for easier analysis.
[0044] Autoregressive Language Model: A language model that generates text or phoneme sequences by progressively generating or predicting the next symbol, used to convert text input into phoneme tokens.
[0045] Categorical Labels: Labels used for classification, such as emotion categories like "pleasure" or "sadness," which can help the model learn the emotional features of different categories.
[0046] Pseudo-Emotional Dimensions: These simulate emotional dimensions through representations within the model, rather than learning from actual emotional data, in order to achieve emotional synthesis in the absence of emotional speech data.
[0047] With the increasing demand for emotional voice in applications such as intelligent customer service, virtual assistants, education, and entertainment, the market urgently needs a text-to-speech (TTS) technology that can flexibly control emotional expression and has low training costs to improve user interaction experience and system intelligence.
[0048] Currently, emotional speech synthesis systems face significant challenges in mimicking complex human emotions, primarily due to the scarcity of emotional data and the limitations of models in capturing emotional features. The high cost and difficulty in acquiring emotional speech data restrict model training and optimization, consequently affecting the emotional expressiveness and naturalness of synthesized speech. Furthermore, current emotional speech synthesis systems lack finesse in controlling the emotional dimension, making it difficult to accurately simulate the complex and ever-changing emotional states of humans.
[0049] The emotional speech synthesis systems in related technologies have the following drawbacks.
[0050] Defect 1: It relies on emotional speech data, but the acquisition of emotional speech data is costly and complex, resulting in high training costs and limiting the training and optimization of the model.
[0051] Defect 2: It lacks finesse in controlling the emotional dimension, making it difficult to accurately simulate the complex and ever-changing emotional states of humans, and it lacks support for complex emotions.
[0052] To address the aforementioned deficiencies, no effective solution has been proposed prior to this disclosure.
[0053] According to embodiments of this disclosure, a model training method is provided. It should be noted that the steps shown in the flowcharts in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions. Furthermore, although a logical order is shown in the flowcharts, in some cases, the steps shown or described may be executed in a different order than that shown here.
[0054] Considering the large number of model parameters in large models and the limited computing resources of mobile terminals, the method provided in this disclosure can be applied to the application scenario shown in Figure 1, but is not limited thereto. In the application scenario shown in Figure 1, the large model is deployed on server 10. Server 10 can connect to one or more client devices 20 via a local area network (LAN), wide area network (WAN), Internet, or other types of data networks. These client devices 20 may include, but are not limited to, smartphones, tablets, laptops, PDAs, personal computers, smart home devices, and in-vehicle devices. Client devices 20 can interact with users through a graphical user interface to invoke the large model, thereby implementing the method provided in this disclosure.
[0055] In this embodiment of the disclosure, the system consisting of a client device and a server can perform the following steps: the client device performs steps such as acquiring an emotional speech dataset and a predefined anchor set, and sending the emotional speech dataset and the predefined anchor set to the server; the server performs steps such as training an initial emotional dimension prediction model based on the acquired emotional speech dataset and the predefined anchor set, generating a target emotional dimension prediction model, and returning the target emotional dimension prediction model to the client device. It should be noted that, provided that the operating resources of the client device can meet the deployment and operation conditions of a large model, this embodiment of the disclosure can be performed on the client device.
[0056] It should be noted that with the rapid development of high-performance computing units, the methods provided in this disclosure can also be applied to integrated model machines in other application scenarios. In one optional embodiment, the integrated model machine has multiple built-in models. Users can select one model to adjust as needed to obtain their own model. The high-performance computing unit built into the integrated model machine can then directly call the adjusted model to execute the methods provided in this disclosure. In another optional embodiment, the large integrated model machine has a pre-trained model built-in. Therefore, the high-performance computing unit built into the integrated model machine can directly call this model to execute the methods provided in this disclosure.
[0057] Furthermore, when users need to train their own models, they can upload their own datasets via the client. These datasets are then sent to the server, allowing the server to adjust the pre-trained model using the dataset to obtain the user's customized model, which can then be deployed to the production environment. To facilitate users' model adjustment needs, the server provides complete adjustment tools, development frameworks, and processes, supporting multiple adjustment strategies. This allows the adjusted model to better adapt to different application domains and achieve a high degree of customization.
[0058] Under the above operating environment, this disclosure provides a model training method as shown in Figure 2. Figure 2 is a flowchart of a model training method according to an embodiment of this disclosure. As shown in Figure 2, the method may include the following steps:
[0059] Step S21: Obtain the emotional speech dataset and the predefined anchor set, wherein the predefined anchor set is used to define various different types of emotion categories;
[0060] Step S22: Train the initial emotion dimension prediction model based on the emotion speech dataset and the predefined anchor point set to generate the target emotion dimension prediction model. The target emotion dimension prediction model is used to map the discrete emotion feature vectors corresponding to the emotion speech dataset to the multi-dimensional emotion space to guide the emotion speech synthesis model to predict the acoustic details corresponding to the prompt audio.
[0061] In this embodiment of the disclosure, an emotional speech dataset can be understood as a collection of speech data samples. For example, the speech data samples in an emotional speech dataset can come from different corpora, such as actors' performances, recordings of emotional readings, and emotional dialogues, etc., without limitation. It is understood that, compared to ordinary speech datasets, emotional speech datasets possess a rich and diverse range of emotional expressions, aiming to cover multiple emotional states and intensities, as well as emotional expressions in different contexts. Ordinary speech datasets, on the other hand, have relatively limited diversity in emotional expression and do not specifically consider emotional factors.
[0062] Predefined emotion anchors, also known as predefined sentiment anchors, can be understood as pre-determined sentiment feature points used to define various types of sentiment categories. The predefined anchor sets in this disclosure are set in a three-dimensional sentiment space, specifically defining typical sentiment categories in the spaces of Pleasure ('P'), Arousal ('A'), and Dominance ('D').
[0063] For example, emotion categories include basic emotions such as happiness, sadness, anger, fear, surprise, and disgust, as well as complex emotions such as shame, pride, gratitude, indifference, unease, sympathy, anxiety, tension, annoyance, excitement, frustration, peace, envy, and guilt, without limitation. Happiness might be defined as high pleasure, high arousal, and high dominance, while sadness might be defined as low pleasure, low arousal, and moderate dominance. A predefined set of anchor points provides the specific location of different emotions in the emotion space, thereby helping the model understand the characteristics of specific emotions and finely control emotional styles.
[0064] Understandably, when processing speech signals, the original continuous audio waveform is converted into a series of analyzable features that can describe the pitch, volume, speech rate, rhythm, and other aspects of speech. The discrete emotion feature vector corresponding to an emotional speech dataset can be understood as extracting a series of emotion-related features by analyzing each speech sample in the dataset, and then quantifying these features as a numerical vector to describe the acoustic properties and emotion-related characteristics of the speech.
[0065] A multidimensional emotional space can be understood as a theoretical model for quantifying and describing human emotional states. It treats emotions as positions or vectors across multiple continuous dimensions, rather than simply classifying them into discrete emotional categories (such as happiness, sadness, anger, etc.). In this embodiment, the multidimensional emotional space can be a three-dimensional emotional space, namely, a space of pleasure, arousal, and dominance, or it can include other dimensions, such as tension, certainty, gloom, and sociality, etc., which are not limited here.
[0066] Emotional speech synthesis models can be understood as models based on emotional text-to-speech (Emotional TTS) technology that can generate natural, nuanced, and emotionally expressive speech in different application contexts. They not only focus on the clarity and coherence of speech but also emphasize the emotional expression in speech, making the synthesized speech closer to the natural way humans communicate.
[0067] Mapping discrete emotional feature vectors corresponding to emotional speech datasets to a multi-dimensional emotional space can be understood as the process of mapping discrete emotional feature vectors to continuous values within a specific coordinate or range in the multi-dimensional emotional space. This is also the process of converting the classified emotional representation into numerical values, thereby continuously representing discrete emotional feature vectors in the emotional space. This enables the model to learn and generate a wider range of emotional styles, improving the naturalness and diversity of synthesized speech, especially in the case of training without emotional data.
[0068] The Emotional Dimension Predictor (EDP) maps discrete emotion feature vectors from an emotional speech dataset to a multi-dimensional emotion space, guiding the emotional speech synthesis model to predict the acoustic details corresponding to the prompt audio. Essentially, the EDP provides guidance on "emotion" to the emotional speech synthesis model, thus instructing it to predict the acoustic details of the prompt audio. For example, when the emotional speech synthesis model receives a prompt audio (or a specific speech sample), the EDP analyzes the emotional features of the audio and converts them into coordinates in the emotion space. Based on these coordinates—the representation of the target emotion—the model predicts the corresponding acoustic details, ensuring that the generated speech not only matches the text in linguistic content but also maintains consistency with the prompt audio in emotional expression.
[0069] As can be seen, in this embodiment, by acquiring an emotional speech dataset and a predefined anchor point set, and then training an initial emotional dimension prediction model based on the emotional speech dataset and the predefined anchor point set, a target emotional dimension prediction model is generated. The trained target emotional dimension prediction model is used to map the discrete emotional feature vectors corresponding to the emotional speech dataset to a multi-dimensional emotional space, thereby guiding the emotional speech synthesis model to predict the acoustic details corresponding to the prompt audio. Thus, by acquiring a predefined anchor point set, various different types of emotional categories are defined and standardized, enabling the target emotional dimension prediction to understand and distinguish different emotional states, thereby providing consistency and predictability in subsequent emotional prediction and speech synthesis. Simultaneously, training the emotional dimension prediction model through a predefined anchor point set reduces the dependence on emotional speech data, allowing the emotional dimension prediction model to be trained on relatively few emotional speech samples, reducing training costs. Furthermore, the trained target emotional dimension prediction model can map the discrete emotional feature vectors extracted from the emotional speech dataset to a multi-dimensional emotional space, thereby not only recognizing emotional categories but also capturing subtle differences in emotions, such as emotional intensity and emotional color, resulting in richer and more nuanced emotional expression. Furthermore, the introduction of a multi-dimensional emotional space allows target emotional dimension prediction to represent emotions numerically, rather than relying solely on discrete emotional labels. Multiple emotional dimensions, such as pleasure, arousal, and dominance, make emotional expression more continuous, better simulating the complexity and diversity of human emotions, thus providing more flexible and refined means for emotional control.
[0070] In addition, the target emotion dimension prediction model trained in this embodiment can guide the emotion speech synthesis model. That is, the target emotion dimension prediction model is responsible for converting the emotion features of the input audio into continuous and quantifiable dimension values. The emotion speech synthesis model then guides the speech synthesis based on these values, ensuring that the synthesized speech can accurately convey the required emotion. Thus, the target emotion dimension prediction model trained in this embodiment can provide more delicate emotion control for the emotion speech synthesis model, improve the precision of emotion control and the naturalness of expression in complex emotion simulation, and reduce the dependence of the emotion speech synthesis model on emotion speech data. Even in the absence of a large amount of emotion speech data, it can still generate high-quality and diverse emotion speech.
[0071] The model training method provided in this disclosure can be applied, but is not limited to, to application scenarios involving emotion dimension prediction model training or emotion speech synthesis model training in fields such as e-commerce services, education services, legal services, medical services, conference services, social networking services, financial product services, logistics services, and navigation services. For example, it can be applied to model training related to intelligent customer service or virtual assistant projects in e-commerce services, model training related to education services, model training related to legal services, etc. There are no limitations here.
[0072] By employing the embodiments of this disclosure, an emotional speech dataset and a predefined anchor point set are acquired. Then, an initial emotional dimension prediction model is trained based on the emotional speech dataset and the predefined anchor point set to generate a target emotional dimension prediction model. The trained target emotional dimension prediction model is used to map the discrete emotional feature vectors corresponding to the emotional speech dataset to a multi-dimensional emotional space, thereby guiding the emotional speech synthesis model to predict the acoustic details corresponding to the prompt audio. This achieves the goal of training the emotional dimension prediction model and the emotional speech synthesis model at low cost, thus realizing that the model training of the emotional dimension prediction model and the emotional speech synthesis model does not depend on emotional speech data. The emotional dimension prediction model improves the precision of emotional control and the naturalness of complex emotional simulation in the emotional speech synthesis model, enabling the generation of high-quality and diverse emotional speech. This solves the technical problems of related technologies where emotional speech synthesis systems rely on emotional speech data, have high training costs, and exhibit poor precision in emotional control and complex emotional simulation.
[0073] In an optional embodiment, the initial sentiment dimension prediction model includes a feature transformation module, a classification module, and an anchoring dimensionality reduction module. In step S22, the initial sentiment dimension prediction model is trained based on the sentiment speech dataset and a predefined anchor point set to generate a target sentiment dimension prediction model, including the following method steps:
[0074] Step S221: Use the feature transformation module to perform feature transformation on the emotional speech dataset to obtain emotional feature vectors;
[0075] Step S222: Use the classification module to classify the emotional feature vectors into emotional states to obtain emotional category target tags;
[0076] Step S223: Based on the predefined anchor point set, guide the anchoring dimensionality reduction module to reduce the dimensionality of the sentiment feature vector to obtain the sentiment dimension vector;
[0077] Step S224: Train the initial sentiment dimension prediction model based on sentiment feature vector, sentiment category target tag and sentiment dimension vector to generate target sentiment dimension prediction model.
[0078] In this embodiment of the disclosure, the initial sentiment dimension prediction model includes: a feature transformation module, a classification layer, and an anchored dimensionality reduction module. The feature transformation module is used to extract key acoustic features from the sentiment speech dataset.
[0079] The classification module receives the sentiment feature vector output by the feature transformation module and classifies it to predict the sentiment category label corresponding to the sentiment speech sample. The classification module can be, for example, a multilayer perceptron (MLP) or a multi-class logistic regression; there are no restrictions here.
[0080] The anchoring dimensionality reduction module is used to extract key emotional dimension information, such as pleasure, arousal and dominance, from high-dimensional emotional feature vectors and convert them into low-dimensional representations.
[0081] When training an initial emotion dimension prediction model based on an emotional speech dataset and a predefined anchor point set to generate a target emotion dimension prediction model, a feature transformation module can be used to transform the features of the emotional speech dataset to obtain emotion feature vectors. This can be understood as processing a speech dataset containing various emotional expressions through the feature transformation module to extract emotion-related features. For example, the emotion feature vectors may include acoustic features such as pitch, intensity, spectrum, and formants, used to represent the physical characteristics of emotional expression in speech.
[0082] After obtaining the emotion feature vector, a classification module can be used to classify the emotion state of the feature vector, resulting in an emotion categorical label. This can be understood as using the classification module to categorize the obtained emotion feature vector into different emotion states, thereby obtaining the emotion categorical label. For example, emotion states can include, but are not limited to, happiness, sadness, anger, and calmness. Thus, an emotion categorical label is generated for each emotional speech sample. These labels help the model learn the feature representations of different emotion states, enabling it to distinguish and generate speech with different emotions in subsequent predictions.
[0083] Meanwhile, the anchoring dimensionality reduction module can be guided to reduce the dimensionality of the emotion feature vectors based on a predefined set of anchor points, resulting in emotion dimension vectors. This can be understood as introducing a predefined set of anchor points to guide the dimensionality reduction process of the anchoring dimensionality reduction module, enabling the module to map the emotion feature vectors in the high-dimensional space to a lower-dimensional emotion space while maintaining their relative position with the anchor points in the emotion dimension. This makes it easier for the emotion speech synthesis model to process and continuously control the emotion.
[0084] For example, suppose a set of emotional speech data is provided, where each sample may contain multiple features such as pitch, intensity, speech rate, formants, and zero-crossing rate. The vector formed by these features can be tens or hundreds of dimensions, represented as a high-dimensional emotional feature vector. To map the high-dimensional emotional feature vector to a low-dimensional emotional space (e.g., a three-dimensional emotional space) while preserving the relative distances and relationships between emotional categories, this disclosure introduces a predefined set of anchor points, defining an anchor point—a three-dimensional coordinate—for each emotion. For example, "happiness" might correspond to (0.8, 0.8, 0.6), representing high pleasure, high arousal, and moderate dominance; "sadness" might correspond to (0.2, 0.2, 0.4), representing low pleasure, low arousal, and moderate dominance. Feature extraction is then performed, extracting acoustic features from the emotional speech data to form a high-dimensional feature vector, such as a 128-dimensional vector. Then, nonlinear dimensionality reduction techniques such as Uniform Manifold Approximation and Projection (UMAP) or t-distributed Stochastic Neighbor Embedding (t-SNE) are used to project the high-dimensional feature vectors into a lower-dimensional space to preserve the similarity between samples.
[0085] Finally, the initial emotional dimension prediction model is trained based on the emotional feature vector, emotional category target tag, and emotional dimension vector to generate the target emotional dimension prediction model. This can be understood as using the emotional feature vector, emotional category target tag, and emotional dimension vector as training inputs, and adjusting the weights and parameters of the initial emotional dimension prediction model through optimization algorithms such as backpropagation. This allows the initial emotional dimension prediction model to learn the mapping relationship between emotional features and emotional dimensions, ultimately generating the target emotional dimension prediction model. This means generating a model that can accurately predict emotional dimensions (such as pleasure, arousal, and dominance), enabling the target emotional dimension prediction model to not only identify emotional categories but also capture continuous changes in emotions, achieving more nuanced emotional control.
[0086] In an optional embodiment, the feature conversion module includes: a pre-trained acoustic model and a linear module. In step S221, the feature conversion module is used to perform feature conversion on the emotional speech dataset to obtain emotional feature vectors, including the following method steps:
[0087] Step S2211: Use a pre-trained acoustic model to extract features from the emotional speech dataset to obtain acoustic feature vectors;
[0088] Step S2212: The acoustic feature vector is transformed linearly using a linear module to obtain the emotion feature vector, wherein the feature dimension of the emotion feature vector is a preset fixed value.
[0089] In this embodiment, the feature transformation module includes a pre-trained acoustic model and a linear layer. The pre-trained acoustic model can be a self-supervised acoustic model based on the Transformer architecture, specifically designed for processing acoustic signals, particularly speech waveforms. The pre-trained acoustic model is pre-trained on large-scale unlabeled audio data through self-supervised learning (SSL), thereby learning deep-level features of the audio signal without relying on a large amount of labeled audio data.
[0090] A linear module can be understood as a fully connected neural network layer used to linearly transform input features. In this embodiment of the disclosure, the linear module is used after the pre-trained acoustic model to map the complex features extracted by the model to a feature space suitable for a specific task.
[0091] When using a feature transformation module to transform the features of an emotional speech dataset to obtain emotional feature vectors, a pre-trained acoustic model can be used to extract features from the emotional speech dataset to obtain acoustic feature vectors. This can be understood as using a pre-trained acoustic model to extract features from audio samples in the emotional speech dataset, thereby capturing deep-level features of the audio signal, including but not limited to acoustic details such as pitch, intensity, spectrum, and formants, as well as the emotional information related to these details. This helps in understanding and imitating the emotional expression of human speech.
[0092] For example, each audio sample is fed into a pre-trained acoustic model, which outputs a high-dimensional acoustic feature vector containing rich acoustic and potential emotional information. Even audio data without emotional annotation can extract effective emotion-related features from it.
[0093] After obtaining the acoustic feature vectors, a linear module is used to perform a linear feature transformation on them, resulting in an emotion feature vector. This can be understood as further processing the acoustic feature vectors using a linear module, converting them into emotion feature vectors. This transformation process can be implemented through one or more linear layers, with the aim of mapping the high-dimensional acoustic feature vectors to a lower-dimensional emotion feature space with pre-defined fixed values. The resulting emotion feature vectors emphasize emotion-related features; for example, they may emphasize acoustic attributes closely related to emotional expression, such as pitch variations, rhythm, and intensity. This is not a restriction here.
[0094] The feature dimension of the emotion feature vector is a preset fixed value, thereby reducing feature redundancy and the complexity of subsequent processing while maintaining the integrity of emotion information. For example, the preset fixed value can be 128 dimensions, but this is not limited here.
[0095] In an optional embodiment, in step S224, the initial sentiment dimension prediction model is trained based on the sentiment feature vector, the sentiment class target tag, and the sentiment dimension vector to generate the target sentiment dimension prediction model, including the following method steps:
[0096] Step S2241: Classify the emotion feature vector based on the emotion category target tag to obtain the classification feature vector, wherein the classification feature vector is the feature vector obtained by classifying according to different emotion categories;
[0097] Step S2242: Determine the target loss between the classification feature vector and the sentiment feature vector;
[0098] Step S2243: Update the parameters of the linear module, classification module and anchored dimensionality reduction module according to the target loss to generate the target sentiment dimension prediction model.
[0099] In this embodiment, when training an initial sentiment dimension prediction model based on sentiment feature vectors, sentiment category target tags, and sentiment dimension vectors to generate a target sentiment dimension prediction model, the sentiment feature vectors can be classified based on the sentiment category target tags to obtain classification feature vectors. This can be understood as using sentiment category target tags as supervision signals to classify sentiment feature vectors and obtain classification feature vectors. This enables the model to understand the internal structure of different types of sentiment features and the relationship between different types of sentiment features and specific sentiment tags. Through classification, the model can learn which features are associated with specific sentiment categories, thereby obtaining classification feature vectors. These classification feature vectors are feature vectors obtained by classifying according to different sentiment categories, and they contain feature information closely related to different sentiment categories.
[0100] After obtaining the classification feature vector, the target loss between the classification feature vector and the sentiment feature vector is determined. For example, the target loss can be the cross-entropy loss, which measures the difference between the probability distribution predicted by the model and the probability distribution of the actual sentiment category. By calculating the target loss between the classification feature vector and the sentiment feature vector, the model's performance in sentiment classification can be quantified, thus providing a basis for subsequent parameter adjustments.
[0101] Finally, the parameters of the linear module, classification module, and anchored dimensionality reduction module are updated based on the target loss to generate a target sentiment dimension prediction model. This can be understood as the model adjusting its parameters based on the calculated target loss to optimize sentiment classification performance. It is understood that parameter updates are typically accomplished through backpropagation, which adjusts the model's weights based on the gradient of the loss function to minimize the loss. For example, the updated objects may include the linear module, classification module, and anchored dimensionality reduction module.
[0102] Thus, through continuous iterative parameter updates, the model gradually learns how to extract emotional information from acoustic features and accurately map it to predefined emotional dimensions (such as pleasure, arousal, and dominance). The resulting target emotional dimension prediction model possesses the ability to predict and control speech emotion without relying on emotional speech data, thereby achieving more natural and richer emotional expression in emotional speech synthesis. This not only improves the model's accuracy but also reduces the need for large amounts of emotional label data, improving training efficiency and the model's generalization ability.
[0103] In an optional embodiment, in step S2241, the emotion feature vector is classified based on the emotion category target tag to obtain a classification feature vector, including the following method steps:
[0104] Step S22411: Create a nearest neighbor graph using sentiment feature vectors, where the nearest neighbor graph is used to reflect the degree of similarity between sentiment feature vectors;
[0105] Step S22412: In the nearest neighbor graph, classify the similar vectors in the sentiment feature vector according to the sentiment category target tag to obtain the classification feature vector.
[0106] In this embodiment of the disclosure, when classifying sentiment feature vectors based on sentiment category target tags to obtain classification feature vectors, a nearest neighbor graph, i.e., a k-Nearest Neighbors (kNN) graph, can be created using the sentiment feature vectors. The nearest neighbor graph describes the distance and similarity between vectors. In the nearest neighbor graph, each sentiment feature vector is connected to other vectors in the graph, and the strength or distance of the connection reflects the degree of similarity between the vectors. Creating a nearest neighbor graph helps the model understand the relationships between different sentiment feature vectors, i.e., which vectors are more likely to represent the same or similar sentiment states.
[0107] After creating the nearest neighbor graph, similar vectors in the emotion feature vectors are classified according to the emotion category tag to obtain classification feature vectors. This can be understood as using emotion category tags (such as happiness, sadness, etc.) to classify the emotion feature vectors in the nearest neighbor graph. For example, the model classifies each vector based on the similarity between vectors and their consistency with the emotion category tag. Further, for each vector in the graph, the model finds the closest emotion category tag and classifies that vector into the corresponding category. Thus, the model learns which feature vectors are closely related to specific emotion categories, thereby obtaining classification feature vectors. These vectors not only contain the original emotion feature information but have also been organized and classified by emotion category, more clearly reflecting the feature patterns of different emotion categories, providing more accurate and structured information for subsequent emotion dimension prediction and speech synthesis.
[0108] In one alternative embodiment, the initial value of the emotion feature vector is determined based on a predefined set of anchor points and preset Gaussian noise.
[0109] In this embodiment of the disclosure, it is understood that each emotion category in the predefined anchor point set corresponds to a specific position in the emotion space, that is, the emotion state can be defined according to the values corresponding to pleasure (P), arousal (A) and dominance (D).
[0110] When extracting emotion feature vectors from an emotional speech dataset, the emotion feature vectors are initialized. The addition of a predefined anchor set and Gaussian noise guides the vectors towards the emotion dimension during initialization, while also increasing randomness to prevent overfitting. Specifically, during the initialization of emotion feature vectors, Gaussian noise is added to each predefined emotion anchor vector to enhance the model's robustness and generalization ability. Gaussian noise is a random perturbation whose probability distribution follows a normal distribution. By introducing Gaussian noise into the emotion anchor vectors, an initial emotion feature vector is created. These vectors are distributed around the center of the emotion anchor, but with slight differences. This difference helps the model learn the diversity and subtle variations of emotion expression during training, rather than just a single, idealized emotion representation.
[0111] Determining the initial values of the emotion feature vectors based on a predefined set of anchor points and pre-defined Gaussian noise can be understood as follows: when creating the initial values of the emotion feature vectors, the model generates an initial set of emotion feature vectors based on each predefined emotion anchor vector and the Gaussian noise added to these vectors. While these vectors are initially close to a certain emotion category, the addition of noise makes them not entirely identical, thus allowing the model to learn continuous changes in emotion. By using this initialization strategy in the early stages of training, the model can explore and learn in the emotion space from the beginning, rather than being limited to a single emotion point. This helps the model predict emotion features more accurately in subsequent training and generate more natural and diverse emotional speech.
[0112] By combining predefined emotion anchors and preset Gaussian noise, the model can not only understand the differences between emotion categories, but also handle the uncertainty of emotion expression, thereby achieving more delicate and flexible emotion control in emotional speech synthesis.
[0113] In one alternative embodiment, the emotional dimension vector is a representation vector of a three-dimensional emotional space, wherein the three-dimensional emotional space includes: pleasure, arousal, and dominance.
[0114] In this embodiment of the disclosure, the emotion dimension vector is a representation vector of the three-dimensional emotion space (3-D Emotion Representation), wherein the three-dimensional emotion space includes: pleasure, arousal and dominance.
[0115] Figure 3 is a schematic diagram of the training of the emotion dimension prediction model according to an embodiment of the present disclosure. As shown in Figure 3, the present disclosure designs an emotion dimension prediction model (ED predictor) to map discrete emotions to a pleasure-dominance-arousal space. Further, the present disclosure uses an emotional speech dataset to train the emotion dimension prediction model, wherein each emotion category is assigned an anchor point to guide the training. The initial vector of the emotion feature vector is perturbed by adding Gaussian noise (e.g., θ = 0.01). Features are extracted from the emotional speech dataset using a pre-trained acoustic model to obtain acoustic feature vectors. Then, a linear layer performs linear feature transformation on the acoustic feature vectors to output emotion feature vectors. Finally, a classification layer classifies the emotion feature vectors to predict emotion labels, thereby obtaining the emotion category target label.
[0116] This disclosure extracts a 128-dimensional sentiment feature vector from a linear layer and fine-tunes the sentiment dimension vector to optimize its performance on specific sentiment classification tasks. Furthermore, to better distinguish different sentiment categories, this disclosure creates a nearest neighbor graph (kNN) and integrates class tags from these graphs to cluster similar sentiment feature vectors together, thereby better distinguishing different sentiment categories and improving the accuracy and robustness of sentiment classification.
[0117] Subsequently, this disclosure uses UMAP to optimize the 128-dimensional sentiment feature vector extracted from the linear layer, preserving the data's topological structure by projecting it into a lower-dimensional space. Simultaneously, this disclosure further refines these two vectors by minimizing the cross-entropy loss between the high-dimensional sentiment feature vector and the sentiment dimension vector, making them better suited for sentiment classification tasks. For example, the cross-entropy loss can be minimized using optimization algorithms (such as gradient descent) to achieve a higher matching degree between the sentiment feature vector and the sentiment dimension vector. This allows the sentiment dimension prediction model to estimate the pleasure-arousal-dominance values for new speech samples, guiding acoustic predictions in the sentiment speech synthesis model.
[0118] According to an embodiment of this disclosure, a model training method is also provided as shown in FIG4. FIG4 is a flowchart of a model training method according to an embodiment of this disclosure. As shown in FIG4, the method includes:
[0119] Step S41: Obtain the training text, training audio, and the corresponding training text transcription of the training audio;
[0120] Step S42: Use the target sentiment dimension prediction model to predict the sentiment dimension of the training audio to obtain the sentiment dimension vector corresponding to the training audio. The target sentiment dimension prediction model is obtained by using any of the above-mentioned model training methods.
[0121] Step S43: Train the initial emotional speech synthesis model based on the training text, training audio, training text transcription, and emotional dimension vector to generate the target emotional speech synthesis model. The target emotional speech synthesis model is used to synthesize the target text, prompt audio, and prompt text transcription corresponding to the prompt audio to obtain the target emotional speech.
[0122] In this embodiment, training text can be understood as the text content for which speech is to be synthesized, training audio can be understood as a real speech sample containing emotional expression, and the training text transcription corresponding to the training audio is the written form of the training audio, used by the model to understand the specific content in the audio. By acquiring training text, training audio, and the training text transcription corresponding to the training audio, the necessary information for the model to learn text-to-speech conversion and emotional expression can be provided.
[0123] After obtaining the training text, training audio, and the corresponding transcribed training text, a target sentiment dimension prediction model is used to predict the sentiment dimension of the training audio, resulting in a sentiment dimension vector corresponding to the training audio. The target sentiment dimension prediction model can be obtained using any of the aforementioned model training methods, which will not be elaborated here. It can be understood that using the target sentiment dimension prediction model to predict the sentiment dimension of the training audio involves the prediction model analyzing the acoustic features in the training audio and mapping these features to a sentiment dimension space (such as pleasantness, arousal, and dominance), generating a corresponding sentiment dimension vector. This sentiment dimension vector contains the emotional details expressed in the training audio and will serve as one of the important inputs for the next step of training the emotional speech synthesis model, guiding the model to learn how to generate sound based on the sentiment dimension.
[0124] After obtaining the sentiment dimension vector corresponding to the training audio, the initial sentiment speech synthesis model is trained based on the training text, training audio, training text transcription, and sentiment dimension vector to generate the target sentiment speech synthesis model. This target sentiment speech synthesis model is used to synthesize sentiment speech from the target text, cue audio, and the corresponding cue text transcription to obtain the target sentiment speech. This can be understood as the initial sentiment speech synthesis model learning how to generate synthesized speech with sentiment consistent with the training audio based on the input training text, training audio, training text transcription, and sentiment dimension vector. During training, the initial sentiment speech synthesis model undergoes numerous iterations to adjust its parameters, minimizing the differences in acoustic features and sentiment expression between the synthesized speech and the training audio. The final target sentiment speech synthesis model possesses the ability to synthesize speech conforming to a specific sentiment dimension given the target text, cue audio, and cue text transcription.
[0125] In this embodiment, training text, training audio, and the corresponding training text transcription are obtained. Then, a target sentiment dimension prediction model is used to predict the sentiment dimension of the training audio, resulting in a sentiment dimension vector corresponding to the training audio. The target sentiment dimension prediction model is obtained using any of the aforementioned model training methods. Finally, an initial sentiment speech synthesis model is trained based on the training text, training audio, training text transcription, and sentiment dimension vector to generate a target sentiment speech synthesis model. This target sentiment speech synthesis model is used to synthesize sentiment speech from target text, cue audio, and the corresponding cue text transcription to obtain target sentiment speech. Thus, the sentiment speech synthesis model not only learns how to convert text to speech but also learns how to finely control emotional expression during the synthesis process, enabling the synthesized speech to reflect the emotional features of the input text and audio. Therefore, even without training with sentiment speech data, it can still generate rich and natural speech, effectively solving the problems of insufficient data and model limitations in sentiment speech synthesis, and improving the emotional realism and naturalness of the synthesized speech.
[0126] The model training method provided in this disclosure can be applied, but is not limited to, to application scenarios involving the training of emotional speech synthesis models in fields such as e-commerce services, education services, legal services, medical services, conference services, social networking services, financial product services, logistics services, and navigation services. For example, it can be applied to the training of relevant models for intelligent customer service or virtual assistant projects in e-commerce services, the training of relevant models in education services, the training of relevant models in legal services, etc. There are no limitations here.
[0127] By employing the embodiments of this disclosure, training text, training audio, and the corresponding training text transcription are obtained. Then, a target sentiment dimension prediction model is used to predict the sentiment dimension of the training audio, resulting in a sentiment dimension vector corresponding to the training audio. The target sentiment dimension prediction model is obtained using any of the aforementioned model training methods. Finally, an initial sentiment speech synthesis model is trained based on the training text, training audio, training text transcription, and sentiment dimension vector to generate a target sentiment speech synthesis model. The target sentiment speech synthesis model is used to synthesize sentiment speech from the target text, prompt audio, and the corresponding prompt text transcription to obtain the target sentiment speech. This achieves the goal of obtaining the sentiment dimension prediction model and the sentiment speech synthesis model at low cost, thus realizing that the model training of the sentiment dimension prediction model and the sentiment speech synthesis model does not depend on sentiment speech data. The sentiment dimension prediction model improves the precision of sentiment control and the naturalness of expression in complex sentiment simulation of the sentiment speech synthesis model, enabling the generation of high-quality and diverse sentiment speech. This solves the technical problems of sentiment speech synthesis systems in related technologies, such as dependence on sentiment speech data, high training costs, and poor precision of sentiment control and complex sentiment simulation.
[0128] In an optional embodiment, the initial emotional speech synthesis model includes: an initial autoregressive language model and an initial non-autoregressive language model. In step S43, the initial emotional speech synthesis model is trained based on training text, training audio, the training text transcription corresponding to the training audio, and the emotional dimension vector corresponding to the training audio. The generation of the target emotional speech synthesis model includes the following method steps:
[0129] Step S431: Obtain the first phoneme input corresponding to the training text, the second phoneme input corresponding to the training text transcription, the phoneme feature vector corresponding to the training audio, and the acoustic feature vector corresponding to the training audio.
[0130] Step S432: The initial autoregressive language model is trained using the first phoneme input, the second phoneme input, and the phoneme feature vector to generate the target autoregressive language model.
[0131] Step S433: The initial non-autoregressive language model is trained using the first phoneme input, the second phoneme input, the predicted phoneme feature vector, the acoustic feature vector, and the emotion dimension vector output by the initial autoregressive language model to generate the target non-autoregressive language model.
[0132] Step S434: Construct a target emotion speech synthesis model based on the target autoregressive language model and the target non-autoregressive language model.
[0133] In this embodiment of the disclosure, considering that speech contains both linguistic information and paralinguistic information, in order to effectively process these two types of information, the training of the TTS system is divided into two stages. The first stage is used to convert phoneme information into detailed phonetic information, and the second stage is used to convert detailed phonetic information into acoustic features.
[0134] The initial emotional speech synthesis model includes an initial autoregressive language model (Autoregressive LM) and an initial non-autoregressive language model (Non-Autoregressive LM). The initial autoregressive language model is used to perform the first stage, and the initial non-autoregressive language model is used to perform the second stage.
[0135] The first phoneme input corresponding to the training text can be understood as the transformation result obtained by transcribing the training text into a phoneme (G2P), that is, the transformation result of the input text input into the initial autoregressive language model after G2P.
[0136] The second phoneme input corresponding to the training text transcription can be understood as the transformation result obtained after the training text is transcribed and processed by G2P, which is also the transformation result of the prompt text input into the initial autoregressive language model after G2P.
[0137] The phonetic tokens corresponding to the training audio can be understood as extracting phonetic tokens from the training audio. The phonetic tokens contain phoneme details of the audio, such as pitch, intensity, and duration.
[0138] The acoustic tokens corresponding to the training audio can be understood as the acoustic tokens extracted from the training audio. The acoustic tokens are used to reflect the physical properties of speech, such as the spectrum and formants.
[0139] When training an initial emotional speech synthesis model based on training text, training audio, the corresponding training text transcription, and the corresponding emotion dimension vector, and generating a target emotional speech synthesis model, the following can be obtained: the first phoneme input corresponding to the training text, the second phoneme input corresponding to the training text transcription, the phoneme feature vector corresponding to the training audio, and the acoustic feature vector corresponding to the training audio. This can be understood as follows: by processing the training text and training text transcription, they are converted into phoneme-level representations, i.e., the first phoneme input and the second phoneme input. Then, through preprocessing, phoneme feature vectors and acoustic feature vectors are extracted from the training audio, thereby obtaining important inputs for subsequent model training.
[0140] Then, the initial autoregressive language model is trained using the first phoneme input, the second phoneme input, and phoneme feature vectors to generate the target autoregressive language model. This can be understood as using the first phoneme input, the second phoneme input, and phoneme feature vectors to train the initial autoregressive language model. An autoregressive language model can predict the features of the next phoneme or phoneme based on the information of the previous phoneme or phoneme, thus helping the model generate coherent and natural speech. During training, the initial autoregressive language model learns how to predict the correct phoneme features using the input phoneme information and the corresponding phoneme feature vectors, gradually generating the target autoregressive language model. The trained target autoregressive language model possesses the ability to generate phoneme features from phoneme sequences.
[0141] Subsequently, the initial non-autoregressive language model is trained using the first phoneme input, the second phoneme input, the predicted phonetic tokens output by the initial autoregressive language model, the acoustic feature vector, and the sentiment dimension vector, to generate the target non-autoregressive language model. This can be understood as training the initial non-autoregressive language model using the first phoneme input, the second phoneme input, the predicted phonetic tokens, the acoustic feature vector, and the sentiment dimension vector. The non-autoregressive language model can generate acoustic features for all phonemes in parallel without relying on information from the previous phoneme. During training, the initial non-autoregressive language model needs to learn how to simultaneously consider phoneme input, predicted phonetic tokens, acoustic features, and the sentiment dimension vector to generate acoustic features consistent with the input text and audio sentiment, thus gradually generating the target non-autoregressive language model. The trained target non-autoregressive language model can quickly and in parallel generate high-quality acoustic features while considering the sentiment dimension.
[0142] Finally, a target-oriented emotional speech synthesis model is constructed based on a target autoregressive language model and a target non-autoregressive language model. This can be understood as integrating the target autoregressive and non-autoregressive language models to build the target emotional speech synthesis model. The target autoregressive model is responsible for generating phoneme sequences, while the target non-autoregressive model is responsible for generating acoustic features based on phoneme sequences and emotion dimension vectors. After integrating these two models, the target emotional speech synthesis model can receive text input and emotion control signals, generating speech waveforms that conform to both the text content and a specific emotional style. Through this dual-decoder architecture, the target emotional speech synthesis model is optimized in both emotion control and speech synthesis, enabling more efficient and natural generation of emotional speech.
[0143] In an optional embodiment, step S431, obtaining the phoneme feature vector corresponding to the training audio, includes the following method steps:
[0144] Step S4311: Use a pre-trained acoustic model to extract features from the training audio to obtain acoustic feature vectors;
[0145] Step S4312: The acoustic feature vector is transformed linearly using a linear module to obtain the phoneme feature vector.
[0146] In this embodiment of the disclosure, when obtaining the phoneme feature vector corresponding to the training audio, a pre-trained acoustic model can be used to extract features from the training audio to obtain an acoustic feature vector. This can be understood as inputting the training audio into the pre-trained acoustic model, which then extracts features from the training audio. For example, by utilizing its internal complex architecture such as convolutional layers and attention mechanisms, the audio signal is converted into a series of numerical features, which together constitute the acoustic feature vector. The acoustic feature vector contains the physical properties of the audio and can be used for subsequent speech synthesis and emotion prediction tasks.
[0147] After obtaining the acoustic feature vectors, a linear module is used to perform a linear feature transformation on them, resulting in phoneme feature vectors. This can be understood as using a linear module to map the acoustic feature vectors to a more suitable feature space for speech synthesis—the phoneme feature space—through linear transformations (such as matrix multiplication and addition), thus obtaining the phoneme feature vectors. Since phoneme feature vectors more directly correspond to the needs of the speech synthesis model, they can help the model more accurately predict the specific pronunciation details of each phoneme. Through the linear module, the acoustic feature vectors are simplified while retaining information crucial for speech synthesis, such as the pitch, intensity, and duration of phonemes. This transformation allows the phoneme feature vectors to be better integrated with phoneme inputs and other model inputs (such as the emotion dimension vector), thereby guiding the speech synthesis model to generate high-quality, emotional speech.
[0148] In this disclosure, a pre-trained acoustic model and a linear module work together on the training audio to extract acoustic features from the raw audio signal. These features are then transformed into phoneme feature vectors that are more suitable for speech synthesis. This process not only reduces the reliance on emotional speech data but also enables the model to learn and mimic complex emotional expressions more effectively, ultimately synthesizing speech that is highly natural, emotionally rich, and consistent with the input text content.
[0149] As can be seen, this disclosure treats text-to-speech (TTS) as a conditional language modeling task, where speech waveforms are quantized into discrete symbols and then modeled using a language model. However, training a language model typically requires a large amount of data to achieve stable performance, which is impractical for emotional speech. Speech emotion, as a form of speech expressiveness, can be represented by a continuous emotion dimension that can be learned from expressive speech. The goal of this disclosure is to develop a TTS model that learns expressive speech data during training and synthesizes speech capable of conveying specific emotions. Figure 5 is a schematic diagram of the framework of a speech synthesis system according to an embodiment of this disclosure. As shown in Figure 5, the entire TTS system includes an autoregressive language model (i.e., an autoregressive decoder) and a non-autoregressive language model (non-autoregressive decoder). The autoregressive decoder predicts phoneme markers from phoneme inputs (i.e., first phoneme input, second phoneme input) and cue audio, while the non-autoregressive decoder uses emotion dimension embeddings to predict a vocoder (Encoder Encoder) with detailed acoustic details. Finally, the vocoder synthesizes speech waveforms from these vocodes.
[0150] Furthermore, in the first stage, the autoregressive language model learns to map phoneme sequences to phoneme markers. This disclosure utilizes acoustic features with K-means clustering as a phoneme-rich target to train the autoregressive language model. In the second stage, the TTS system focuses on acoustic modeling guided by the emotion dimension, thereby training a non-autoregressive language model by recovering nuanced acoustic details from the phoneme markers predicted from the autoregressive language model. This disclosure uses the emotion dimension vector of a pre-trained emotion dimension prediction model, the predicted phoneme markers, and the acoustic markers generated by the vocode decoder to jointly regulate and train the non-autoregressive language model. During TTS system training, the autoregressive language model focuses solely on language model learning, while the non-autoregressive language model learns fine-grained acoustic details from the cue audio and emphasizes consistency between the generated speech and the cue audio in terms of pleasantness, arousal, and dominance.
[0151] During the inference phase of the TTS system, given the input text and the accompanying audio prompts and their transcripts (i.e., training text, training audio, and the corresponding training text transcripts), an autoregressive language model predicts phoneme markers representing speech variations on phonemes. Subsequently, a non-autoregressive language model predicts acoustic markers based on the sentiment dimension vector, acoustic cues, and phoneme sequences. The sentiment dimension vector can be predicted either from the accompanying audio prompts using a pre-trained sentiment dimension prediction model (i.e., ED prediction) or manually specified by the user (i.e., ED control). Finally, the Encoder decoder synthesizes the speech waveform based on the predicted acoustic markers.
[0152] In addition to the model training methods described above, this disclosure can perform label-based emotion control, that is, directly use predefined emotion labels (such as happiness, sadness, and anger) for training and generation. The model synthesizes speech with corresponding emotions by learning the relationship between the labels and speech data. Alternatively, it can perform an emotion synthesis model based on Generative Adversarial Network (GAN), that is, use the GAN structure to generate emotional speech, and train the model adversarially between the generator and the discriminator to generate more emotionally expressive natural speech. Or, it can perform emotion feature extraction based on self-supervised learning, that is, use a self-supervised learning (SSL) model to extract speech features, and then use these features for emotion prediction and generation. Thus, by training with unlabeled data, the model can generalize to different emotion expressions. There are no limitations on this.
[0153] As can be seen, this disclosure designs an emotion dimension prediction model that maps discrete emotions to a pleasure-dominance-arousal space, helping the model to achieve nuanced and continuous control over emotions. This multi-dimensional emotion mapping allows emotion expression to move beyond a single label and be flexibly controlled through numerical dimensions. Therefore, during speech generation, various emotion expressions can be achieved by adjusting the emotion dimension, resulting in more nuanced and natural speech performance. Simultaneously, this disclosure introduces a pre-trained acoustic model for feature extraction, obtaining rich emotion features. Even without emotion labels, a self-supervised learning model can capture emotion features in speech, further optimized by combining emotion category labels. This eliminates the need for large amounts of emotion-labeled data to capture emotional details in speech, reducing data dependence while ensuring the naturalness and diversity of emotion expression. Furthermore, this disclosure designs a dual-decoding structure, including an autoregressive decoder and a non-autoregressive decoder. The autoregressive decoder predicts phoneme markers from phoneme input and cue audio, while the non-autoregressive decoder generates nuanced acoustic features based on emotion dimension embedding. This architecture enables the generation of natural and fluent speech while controlling emotion expression, further enhancing the emotional expressiveness and naturalness of the speech.
[0154] As can be seen, inspired by the success of language models in various generative tasks, current research largely prioritizes data collection, aiming to advance the technology by expanding training data. In contrast, this disclosure explores how to leverage early findings in affective theory to advance affective TTS research. This disclosure proposes a TTS framework capable of controlling three affective dimensions—pleasure, arousal, and dominance—at runtime. This framework not only synthesizes multiple affective styles without requiring training with affective speech data but also achieves naturalness and prosodic expression consistent with speech cues in zero-shot scenarios.
[0155] It is easy to understand that the beneficial effects of the model training method provided in this disclosure include the following points.
[0156] Beneficial effects (1): The emotional attribute predictor is trained by using class target labels and self-supervised learning feature reduction method, so that the present invention does not require a large amount of emotional speech data during the training process, reducing the data collection cost and labeling difficulty, and clearing the data obstacles for the widespread application of emotional speech synthesis technology.
[0157] Beneficial effect (2): By introducing the emotional control dimension (such as pleasure, arousal and dominance), this disclosure can generate speech with complex emotional style, providing diversity and subtlety of emotional expression, making the synthesized speech closer to human natural expression, and enhancing the expressiveness and naturalness of speech synthesis.
[0158] Beneficial effect (3): By using an autoregressive language model, text is converted into phoneme tags and acoustic details are predicted in parallel, which accelerates the model training process, allows high-quality emotional speech synthesis with low computing resources, and reduces the time and cost of model training.
[0159] Beneficial effect (4): Since this disclosure does not rely on specific emotional speech data, it has strong versatility and can be quickly applied to the emotional synthesis needs of different languages, as well as various application scenarios, such as intelligent customer service, virtual assistant, education, entertainment, etc., thus expanding the scope of application of emotional speech synthesis technology.
[0160] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this disclosure are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of the relevant data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation portals are provided for users to choose to authorize or refuse.
[0161] Furthermore, it should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that this disclosure is not limited to the described order of actions, as some steps may be performed in other orders or simultaneously according to this disclosure. Secondly, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily essential to this disclosure.
[0162] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, they can also be implemented by hardware. Based on this understanding, the technical solutions of this disclosure, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods described in the various embodiments of this disclosure.
[0163] According to an embodiment of this disclosure, an audio processing method as shown in FIG6 is also provided. FIG6 is a flowchart of an audio processing method according to an embodiment of this disclosure. As shown in FIG6, the method includes:
[0164] Step S61: Obtain the target text, the prompt audio, and the transcribed prompt text corresponding to the prompt audio;
[0165] Step S62: Use the target emotional speech synthesis model to synthesize emotional speech from the target text, the prompt audio, and the transcribed prompt text to obtain the target emotional speech; wherein, the target emotional speech synthesis model is obtained by using any of the above-mentioned model training methods.
[0166] In this embodiment of the disclosure, the target text can be understood as the text content for which the synthesized speech is desired. The cue audio can be understood as reference audio, which typically has a specific emotional style and is used to guide the synthesized speech to be emotionally similar to it. The cue text transcription corresponding to the cue audio can be understood as a text version of the cue audio, used to help the model understand the semantic content of the cue audio.
[0167] By acquiring the target text, the prompt audio, and the corresponding transcribed prompt text, and then using a target emotional speech synthesis model to synthesize emotional speech from the target text, the prompt audio, and the transcribed prompt text, the target emotional speech is obtained. The target emotional speech synthesis model is obtained using any of the model training methods described above, which will not be elaborated here. The target emotional speech can be understood as a piece of speech that contains textual content and carries the same or similar emotions as the prompt audio, closely resembling genuine emotional expression, greatly enhancing the naturalness and expressiveness of the speech synthesis.
[0168] The audio processing method provided in this disclosure can be applied, but is not limited to, to speech synthesis applications in fields such as e-commerce services, education services, legal services, medical services, conference services, social networking services, financial product services, logistics services, and navigation services. For example, speech synthesis for e-commerce services, speech synthesis for education services, and speech synthesis for legal services are not limited here.
[0169] By employing the embodiments of this disclosure, target text, prompt audio, and corresponding prompt text transcription are obtained, and a target emotional speech synthesis model is used to synthesize emotional speech from the target text, prompt audio, and prompt text transcription to obtain target emotional speech. The target emotional speech synthesis model is obtained using any of the aforementioned model training methods. This achieves the goal of generating high-quality, diverse emotional speech, thereby realizing that the training of the emotional dimension prediction model and the emotional speech synthesis model does not depend on emotional speech data. The emotional dimension prediction model improves the precision of emotional control and the naturalness of complex emotional simulation in the emotional speech synthesis model, enabling the generation of high-quality, diverse emotional speech. This solves the technical problems in related technologies where emotional speech synthesis systems rely on emotional speech data, have high training costs, and exhibit poor precision in emotional control and complex emotional simulation.
[0170] In an optional embodiment, the target emotional speech synthesis model includes: a target autoregressive language model, a target non-autoregressive language model, and a target audio decoder. In step S62, the target emotional speech synthesis model is used to synthesize emotional speech from the target text, the prompt audio, and the transcribed prompt text to obtain the target emotional speech, including the following method steps:
[0171] Step S621: Obtain the third phoneme input corresponding to the target text, the fourth phoneme input corresponding to the transcription of the prompt text, the emotional dimension vector corresponding to the prompt audio, the phoneme feature vector corresponding to the prompt audio, and the acoustic feature vector corresponding to the prompt audio.
[0172] Step S622: Input the third phoneme input, the fourth phoneme input, and the phoneme feature vector into the target autoregressive language model, and output the predicted phoneme feature vector;
[0173] Step S623: Input the third phoneme input, the fourth phoneme input, the predicted phoneme feature vector, the acoustic feature vector, and the emotion dimension vector into the target non-autoregressive language model, and output the predicted acoustic feature vector.
[0174] Step S624: The predicted acoustic feature vector is decoded using a target audio decoder to obtain the target emotional speech.
[0175] In this embodiment of the disclosure, the target emotional speech synthesis model includes: a target autoregressive language model, a target non-autoregressive language model, and a target audio decoder. The descriptions of the target autoregressive language model, the target non-autoregressive language model, and the target audio decoder (i.e., Encoder Decoder) are as described in the foregoing embodiments and will not be repeated here.
[0176] When using a target emotional speech synthesis model to synthesize emotional speech from target text, prompt audio, and prompt text transcription, the following can be obtained: the third phoneme input corresponding to the target text, the fourth phoneme input corresponding to the prompt text transcription, the emotional dimension vector corresponding to the prompt audio, the phoneme feature vector corresponding to the prompt audio, and the acoustic feature vector corresponding to the prompt audio. This can be understood as obtaining the third phoneme input obtained by performing G2P conversion on the target text, the fourth phoneme input obtained by performing G2P conversion on the prompt text, the emotional dimension vector output by the emotional dimension prediction model for the prompt audio, and the phoneme feature vector obtained after feature extraction from the prompt audio using a pre-trained acoustic model.
[0177] Then, the third phoneme input, the fourth phoneme input, and the phoneme feature vector are input into the target autoregressive language model, which outputs the predicted phoneme feature vector. This can be understood as the target autoregressive language model combining the third phoneme input, the fourth phoneme input, and the phoneme feature vector to predict and output the phoneme feature vector of the target text, i.e., predicting how phoneme features are expressed under the influence of emotion.
[0178] The third phoneme input, fourth phoneme input, predicted phoneme feature vector, acoustic feature vector, and emotion dimension vector are then input into the target non-autoregressive language model, which outputs predicted acoustic tokens. This can be understood as the target non-autoregressive language model using all the above information to predict and refine acoustic features to reflect emotional changes; the output predicted acoustic tokens contain all the acoustic details needed to generate emotional speech.
[0179] Finally, a target audio decoder is used to decode the predicted acoustic feature vectors to obtain the target emotional speech. This can be understood as the target audio decoder receiving the predicted acoustic feature vectors output from the target non-autoregressive language model and converting them into an audio waveform signal, i.e., the target emotional speech. This decoding process transforms acoustic details into an audible speech signal, ensuring that the speech is highly consistent with the prompt audio in terms of emotional expression, while accurately reflecting the target text in terms of linguistic content.
[0180] It should be noted that the preferred implementation of this embodiment can be found in the relevant descriptions in the embodiments, and will not be repeated here.
[0181] According to an embodiment of this disclosure, an audio processing method is also provided, as shown in FIG7. FIG7 is a flowchart of an audio processing method according to an embodiment of this disclosure. As shown in FIG7, the method includes:
[0182] Step S71: Obtain the input text to be converted to virtual customer service voice, the virtual customer service prompt audio, and the corresponding prompt text transcribed from the virtual customer service prompt audio;
[0183] Step S72: The target emotional speech synthesis model is used to synthesize emotional speech from the input text, the virtual customer service prompt audio, and the transcribed prompt text to obtain the virtual customer service emotional speech; wherein, the target emotional speech synthesis model is obtained by using any of the above-mentioned model training methods.
[0184] In this embodiment of the disclosure, in the virtual customer service application scenario, the input text of the virtual customer service voice can be understood as the text that is to be converted into virtual customer service voice, such as the standard greeting in the customer service scenario, "Hello, welcome to our customer service center, how can I help you?", which is not limited here.
[0185] Virtual customer service prompts can be understood as audio samples representing the specific emotional style of the virtual customer service representative. For example, to generate a friendly and enthusiastic virtual customer service voice, an audio clip with corresponding emotion can be used as a prompt.
[0186] The transcribed text corresponding to the virtual customer service prompt audio can be understood as a text version of the virtual customer service prompt audio. The transcribed text maintains semantic consistency with the audio; for example, the message "Hello, I'm happy to assist you!" is crucial for ensuring that the synthesized speech is semantically consistent with the input text. By comparing the input text and the transcribed text, the model can accurately generate speech that is both content-consistent and emotionally appropriate.
[0187] In this embodiment, the input text to be converted into virtual customer service voice, the virtual customer service prompt audio, and the corresponding transcribed prompt text are obtained. Then, a target emotional speech synthesis model is used to synthesize emotional speech from the input text, the virtual customer service prompt audio, and the transcribed prompt text, resulting in virtual customer service emotional speech. The target emotional speech synthesis model is obtained using any of the model training methods described above, which will not be elaborated here. Therefore, the model can accurately convert the input text into virtual customer service voice with a specific emotional style. This voice is not only accurate in content but also simulates human emotional expression, thereby enhancing the interactivity and user experience of the virtual customer service. This is of great significance for improving the naturalness, emotional responsiveness, and user satisfaction of intelligent customer service systems.
[0188] The audio processing method provided in this disclosure can also be applied, but is not limited to, to speech synthesis applications in fields such as e-commerce services, education services, legal services, medical services, conference services, social networking services, financial product services, logistics services, and navigation services. For example, speech synthesis for e-commerce services, speech synthesis for education services, and speech synthesis for legal services are not limited here.
[0189] By employing the embodiments of this disclosure, the input text to be converted into virtual customer service voice, the virtual customer service prompt audio, and the corresponding prompt text transcription are obtained. Then, a target emotional speech synthesis model is used to synthesize emotional speech from the input text, the virtual customer service prompt audio, and the prompt text transcription, thereby obtaining virtual customer service emotional speech. The target emotional speech synthesis model is obtained using any of the model training methods described above. This achieves the goal of generating high-quality and diverse emotional speech, thus realizing that the model training of the emotional dimension prediction model and the emotional speech synthesis model does not depend on emotional speech data. The emotional dimension prediction model improves the precision of emotional control and the naturalness of expression in complex emotional simulation of the emotional speech synthesis model, enabling the generation of high-quality and diverse emotional speech. This solves the technical problems of emotional speech synthesis systems in related technologies, such as dependence on emotional speech data, high training costs, and poor precision of emotional control and complex emotional simulation.
[0190] It should be noted that the preferred implementation of this embodiment can be found in the relevant descriptions in the embodiments, and will not be repeated here.
[0191] According to embodiments of this disclosure, an audio processing method is also provided, as shown in FIG8. FIG8 is a flowchart of an audio processing method according to an embodiment of this disclosure. As shown in FIG8, the method includes:
[0192] Step S81: Obtain an audio processing request through the first application programming interface, wherein the request data carried in the audio processing request includes: target text, prompt audio, and the transcribed prompt text corresponding to the prompt audio;
[0193] Step S82: Return an audio processing response through the second application programming interface. The response data carried in the audio processing response includes: target emotional speech, which is obtained by synthesizing emotional speech from the target text, the prompt audio, and the transcribed prompt text using a target emotional speech synthesis model. The target emotional speech synthesis model is obtained by training any of the above-mentioned model training methods.
[0194] The first and second application programming interfaces (APIs) mentioned above can be the same or different APIs. In one optional embodiment, the interface parameters in the first and second APIs may include, but are not limited to: a global interface identifier, an interface signing key, an interface timestamp, an interface request identifier, and a system call credential identifier. The first API can use GET or POST as the interface request method to obtain the file processing request. The second API can use JSON format to return the file processing response.
[0195] In this embodiment of the disclosure, the audio processing request can be understood as sending a request to the emotional speech synthesis service. The request includes the target text, the prompt audio, and the transcribed prompt text, instructing the service to convert the target text into speech with the emotional style of the prompt audio.
[0196] An audio processing response can be understood as the response content to an audio processing request. The response contains the target emotional speech obtained after emotional speech synthesis.
[0197] In this embodiment, an audio processing request is obtained through a first application programming interface (API). The request data carried in the audio processing request includes: target text, prompt audio, and a corresponding transcribed prompt text. Then, an audio processing response is returned through a second API. The response data carried in the audio processing response includes: target emotional speech. The target emotional speech is obtained by synthesizing emotional speech from the target text, prompt audio, and transcribed prompt text using a target emotional speech synthesis model. The target emotional speech synthesis model is obtained using any of the aforementioned model training methods, which will not be elaborated here. Therefore, it is possible to generate target emotional speech that not only matches the target text in content but also maintains consistency with the prompt audio in emotional expression and sound quality.
[0198] The audio processing method provided in this disclosure can also be applied, but is not limited to, to speech synthesis applications in fields such as e-commerce services, education services, legal services, medical services, conference services, social networking services, financial product services, logistics services, and navigation services. For example, speech synthesis for e-commerce services, speech synthesis for education services, and speech synthesis for legal services are not limited here.
[0199] In this embodiment of the present disclosure, an audio processing request is obtained through a first application programming interface (API). The request data carried in the audio processing request includes: target text, prompt audio, and the corresponding prompt text transcription. Then, an audio processing response is returned through a second API. The response data carried in the audio processing response includes: target emotional speech. The target emotional speech is obtained by synthesizing the target text, prompt audio, and prompt text transcription using a target emotional speech synthesis model. The target emotional speech synthesis model is obtained using any of the above-mentioned model training methods. This achieves the goal of generating high-quality and diverse emotional speech, thereby realizing that the model training of the emotional dimension prediction model and the emotional speech synthesis model does not depend on emotional speech data. The emotional dimension prediction model improves the precision of emotional control and the naturalness of expression in complex emotional simulation of the emotional speech synthesis model, enabling the generation of high-quality and diverse emotional speech. This solves the technical problems of emotional speech synthesis systems in related technologies, such as dependence on emotional speech data, high training costs, and poor precision of emotional control and complex emotional simulation.
[0200] It should be noted that the preferred implementation of this embodiment can be found in the relevant descriptions in the embodiments, and will not be repeated here.
[0201] According to an embodiment of this disclosure, an audio processing method is also provided, as shown in FIG9. FIG9 is a flowchart of an audio processing method according to an embodiment of this disclosure. As shown in FIG9, the method includes:
[0202] Step S91: Obtain the currently input audio processing dialogue request, wherein the request data carried in the audio processing dialogue request includes: target text, prompt audio, and the transcribed prompt text corresponding to the prompt audio;
[0203] Step S92, in response to the audio processing dialogue request, return an audio processing dialogue response, wherein the information carried in the audio processing dialogue response includes: target emotional speech, which is obtained by synthesizing emotional speech from the target text, the prompt audio and the transcribed prompt text using a target emotional speech synthesis model, and the target emotional speech synthesis model is obtained by using any of the above-mentioned model training methods;
[0204] Step S93: Play the target emotional voice through a speaker.
[0205] In this embodiment of the disclosure, the audio processing dialogue request can be understood as an audio processing request received from the user when the dialogue system (such as intelligent customer service or voice assistant) interacts with the user. The request includes text information (target text) that the user wishes to process and convert into specific emotional speech, an audio clip (cue audio) used for emotional style reference, and a textual transcription of the audio (the cue text transcription corresponding to the cue audio). This request data provides the necessary input for emotional speech synthesis, enabling the system to understand and execute the user's needs and generate a speech response that conforms to a specific emotion.
[0206] Audio processing dialogue responses can be understood as follows: upon receiving an audio processing dialogue request, the system responds by using a target emotional speech synthesis model to process the request data and generate an emotional speech response. This process involves taking the target text, the prompt audio, and its text transcription as input, and using a trained emotional speech synthesis model to perform an emotional speech synthesis task. The resulting target emotional speech matches the emotional style in the user's request, enabling a more natural and appropriate response to the user.
[0207] In this embodiment, an audio processing dialogue request is obtained, wherein the request data carried in the audio processing dialogue request includes: target text, prompt audio, and the corresponding prompt text transcription. Then, in response to the audio processing dialogue request, an audio processing dialogue response is returned, wherein the information carried in the audio processing dialogue response includes: target emotional speech. The target emotional speech is obtained by synthesizing emotional speech from the target text, prompt audio, and prompt text transcription using a target emotional speech synthesis model. The target emotional speech synthesis model is obtained using any of the above-mentioned model training methods, which will not be elaborated here. Finally, the target emotional speech is played through a speaker device to provide feedback to the user, thereby significantly improving the naturalness of virtual dialogue and user experience, especially in situations requiring emotional communication, such as customer service and emotional support systems.
[0208] The audio processing method provided in this disclosure can also be applied, but is not limited to, to speech synthesis applications in fields such as e-commerce services, education services, legal services, medical services, conference services, social networking services, financial product services, logistics services, and navigation services. For example, speech synthesis for e-commerce services, speech synthesis for education services, and speech synthesis for legal services are not limited here.
[0209] By employing the embodiments of this disclosure, an audio processing dialogue request is obtained from the current input. The request data carried in the audio processing dialogue request includes: target text, prompt audio, and the corresponding prompt text transcription. Then, in response to the audio processing dialogue request, an audio processing dialogue reply is returned. The information carried in the audio processing dialogue reply includes: target emotional speech. The target emotional speech is obtained by synthesizing emotional speech from the target text, prompt audio, and prompt text transcription using a target emotional speech synthesis model. The target emotional speech synthesis model is obtained using any of the above-mentioned model training methods, which will not be elaborated here. Finally, the target emotional speech is played through a speaker to provide feedback to the user. This achieves the goal of generating high-quality and diverse emotional speech, thereby realizing that the model training of the emotional dimension prediction model and the emotional speech synthesis model does not depend on emotional speech data. The emotional dimension prediction model improves the precision of emotional control and the naturalness of expression in complex emotional simulation of the emotional speech synthesis model, enabling the generation of high-quality and diverse emotional speech. This solves the technical problems of emotional speech synthesis systems in related technologies, such as dependence on emotional speech data, high training costs, and poor precision of emotional control and complex emotional simulation.
[0210] It should be noted that the preferred implementation of this embodiment can be found in the relevant descriptions in the embodiments, and will not be repeated here.
[0211] According to an embodiment of this disclosure, an apparatus embodiment for implementing the above-described model training method is also provided. FIG10 is a schematic structural diagram of a model training apparatus according to an embodiment of this disclosure. As shown in FIG10, the apparatus includes:
[0212] The first acquisition module 1001 is configured to acquire an emotional speech dataset and a predefined anchor set, wherein the predefined anchor set is used to define multiple different types of emotion categories;
[0213] The first training module 1002 is configured to train the initial emotion dimension prediction model based on the emotion speech dataset and the predefined anchor point set to generate the target emotion dimension prediction model. The target emotion dimension prediction model is used to map the discrete emotion feature vectors corresponding to the emotion speech dataset to the multi-dimensional emotion space to guide the emotion speech synthesis model to predict the acoustic details corresponding to the prompt audio.
[0214] Optionally, the initial sentiment dimension prediction model includes a feature transformation module, a classification module, and an anchoring dimensionality reduction module. The first training module 1002 is further configured to: use the feature transformation module to perform feature transformation on the sentiment speech dataset to obtain sentiment feature vectors; use the classification module to classify the sentiment feature vectors into sentiment states to obtain sentiment class target tags; guide the anchoring dimensionality reduction module to reduce the dimensionality of the sentiment feature vectors based on a predefined anchor set to obtain sentiment dimension vectors; and train the initial sentiment dimension prediction model based on the sentiment feature vectors, sentiment class target tags, and sentiment dimension vectors to generate a target sentiment dimension prediction model.
[0215] Optionally, the feature transformation module includes a pre-trained acoustic model and a linear module. The first training module 1002 is further configured to: use the pre-trained acoustic model to extract features from the emotional speech dataset to obtain an acoustic feature vector; and use the linear module to perform linear feature transformation on the acoustic feature vector to obtain an emotional feature vector, wherein the feature dimension of the emotional feature vector is a preset fixed value.
[0216] Optionally, the first training module 1002 is further configured to: classify the emotion feature vector based on the emotion category target tag to obtain the classification feature vector, wherein the classification feature vector is the feature vector obtained by classifying according to different emotion categories; determine the target loss between the classification feature vector and the emotion feature vector; update the parameters of the linear module, the classification module and the anchoring dimensionality reduction module according to the target loss to generate a target emotion dimension prediction model.
[0217] Optionally, the first training module 1002 is further configured to: create a nearest neighbor graph using sentiment feature vectors, wherein the nearest neighbor graph is used to reflect the similarity between sentiment feature vectors; in the nearest neighbor graph, similar vectors in the sentiment feature vectors are classified according to sentiment class target tags to obtain classification feature vectors.
[0218] Optionally, the initial value of the emotion feature vector is determined based on a predefined set of anchor points and preset Gaussian noise.
[0219] Optionally, the emotional dimension vector is a representation vector of a three-dimensional emotional space, where the three-dimensional emotional space includes: pleasure, arousal, and dominance.
[0220] By employing the embodiments of this disclosure, an emotional speech dataset and a predefined anchor point set are acquired. Then, an initial emotional dimension prediction model is trained based on the emotional speech dataset and the predefined anchor point set to generate a target emotional dimension prediction model. The trained target emotional dimension prediction model is used to map the discrete emotional feature vectors corresponding to the emotional speech dataset to a multi-dimensional emotional space, thereby guiding the emotional speech synthesis model to predict the acoustic details corresponding to the prompt audio. This achieves the goal of training the emotional dimension prediction model and the emotional speech synthesis model at low cost, thus realizing that the model training of the emotional dimension prediction model and the emotional speech synthesis model does not depend on emotional speech data. The emotional dimension prediction model improves the precision of emotional control and the naturalness of complex emotional simulation in the emotional speech synthesis model, enabling the generation of high-quality and diverse emotional speech. This solves the technical problems of related technologies where emotional speech synthesis systems rely on emotional speech data, have high training costs, and exhibit poor precision in emotional control and complex emotional simulation.
[0221] It should be noted that the first acquisition module 1001 and the first training module 1002 mentioned above correspond to steps S21 and S22 in the embodiments. The two modules and the corresponding steps implement the same instances and application scenarios, but are not limited to the content disclosed in the above embodiments. It should be noted that the above modules or units can be hardware components or software components stored in memory and processed by one or more processors. The above modules can also run in the server 10 provided in the embodiments.
[0222] According to embodiments of this disclosure, another apparatus embodiment for implementing the above-described model training method is also provided. Figure 11 is a schematic structural diagram of another model training apparatus according to an embodiment of this disclosure. As shown in Figure 11, the apparatus includes:
[0223] The second acquisition module 1101 is configured to acquire training text, training audio, and the training text transcription corresponding to the training audio.
[0224] Prediction module 1102 is configured to use a target sentiment dimension prediction model to predict the sentiment dimension of the training audio and obtain the sentiment dimension vector corresponding to the training audio. The target sentiment dimension prediction model is obtained by using any of the above model training methods.
[0225] The second training module 1103 is configured to train the initial emotional speech synthesis model based on the training text, training audio, training text transcription, and emotional dimension vector to generate the target emotional speech synthesis model. The target emotional speech synthesis model is used to synthesize the target text, the prompt audio, and the prompt text transcription corresponding to the prompt audio to obtain the target emotional speech.
[0226] Optionally, the initial emotional speech synthesis model includes an initial autoregressive language model and an initial non-autoregressive language model. The second training module 1103 is further configured to: acquire the first phoneme input corresponding to the training text, the second phoneme input corresponding to the transcription of the training text, the phoneme feature vector corresponding to the training audio, and the acoustic feature vector corresponding to the training audio; train the initial autoregressive language model using the first phoneme input, the second phoneme input, and the phoneme feature vector to generate a target autoregressive language model; train the initial non-autoregressive language model using the first phoneme input, the second phoneme input, the predicted phoneme feature vector, the acoustic feature vector, and the emotional dimension vector output by the initial autoregressive language model to generate a target non-autoregressive language model; and construct a target emotional speech synthesis model based on the target autoregressive language model and the target non-autoregressive language model.
[0227] Optionally, the second training module 1103 is further configured to: extract features from the training audio using a pre-trained acoustic model to obtain an acoustic feature vector; and perform linear feature transformation on the acoustic feature vector using a linear module to obtain a phoneme feature vector.
[0228] By employing the embodiments of this disclosure, training text, training audio, and the corresponding training text transcription are obtained. Then, a target sentiment dimension prediction model is used to predict the sentiment dimension of the training audio, resulting in a sentiment dimension vector corresponding to the training audio. The target sentiment dimension prediction model is obtained using any of the aforementioned model training methods. Finally, an initial sentiment speech synthesis model is trained based on the training text, training audio, training text transcription, and sentiment dimension vector to generate a target sentiment speech synthesis model. The target sentiment speech synthesis model is used to synthesize sentiment speech from the target text, prompt audio, and the corresponding prompt text transcription to obtain the target sentiment speech. This achieves the goal of obtaining the sentiment dimension prediction model and the sentiment speech synthesis model at low cost, thus realizing that the model training of the sentiment dimension prediction model and the sentiment speech synthesis model does not depend on sentiment speech data. The sentiment dimension prediction model improves the precision of sentiment control and the naturalness of expression in complex sentiment simulation of the sentiment speech synthesis model, enabling the generation of high-quality and diverse sentiment speech. This solves the technical problems of sentiment speech synthesis systems in related technologies, such as dependence on sentiment speech data, high training costs, and poor precision of sentiment control and complex sentiment simulation.
[0229] It should be noted that the second acquisition module 1101, prediction module 1102, and second training module 1103 mentioned above correspond to steps S41 to S43 in the embodiments. The three modules and their corresponding steps implement the same instances and application scenarios, but are not limited to the content disclosed in the above embodiments. It should be noted that the above modules or units can be hardware or software components stored in memory and processed by one or more processors. The above modules can also run in the server 10 provided in the embodiments.
[0230] According to embodiments of this disclosure, an apparatus embodiment for implementing the above-described audio processing method is also provided. FIG12 is a schematic structural diagram of an audio processing apparatus according to an embodiment of this disclosure. As shown in FIG12, the apparatus includes:
[0231] The third acquisition module 1201 is configured to acquire the target text, the prompt audio, and the corresponding prompt text transcription of the prompt audio.
[0232] The first synthesis module 1202 is configured to use a target emotional speech synthesis model to synthesize emotional speech from the target text, the prompt audio, and the transcribed prompt text to obtain the target emotional speech; wherein the target emotional speech synthesis model is obtained by using any of the above-mentioned model training methods.
[0233] Optionally, the target emotional speech synthesis model includes: a target autoregressive language model, a target non-autoregressive language model, and a target audio decoder. The first synthesis module 1202 is further configured to: acquire the third phoneme input corresponding to the target text, the fourth phoneme input corresponding to the transcribed prompt text, the emotional dimension vector corresponding to the prompt audio, the phoneme feature vector corresponding to the prompt audio, and the acoustic feature vector corresponding to the prompt audio; input the third phoneme input, the fourth phoneme input, and the phoneme feature vector into the target autoregressive language model and output the predicted phoneme feature vector; input the third phoneme input, the fourth phoneme input, the predicted phoneme feature vector, the acoustic feature vector, and the emotional dimension vector into the target non-autoregressive language model and output the predicted acoustic feature vector; and use the target audio decoder to perform audio decoding on the predicted acoustic feature vector to obtain the target emotional speech.
[0234] By employing the embodiments of this disclosure, target text, prompt audio, and corresponding prompt text transcription are obtained, and a target emotional speech synthesis model is used to synthesize emotional speech from the target text, prompt audio, and prompt text transcription to obtain target emotional speech. The target emotional speech synthesis model is obtained using any of the aforementioned model training methods. This achieves the goal of generating high-quality, diverse emotional speech, thereby realizing that the training of the emotional dimension prediction model and the emotional speech synthesis model does not depend on emotional speech data. The emotional dimension prediction model improves the precision of emotional control and the naturalness of complex emotional simulation in the emotional speech synthesis model, enabling the generation of high-quality, diverse emotional speech. This solves the technical problems in related technologies where emotional speech synthesis systems rely on emotional speech data, have high training costs, and exhibit poor precision in emotional control and complex emotional simulation.
[0235] It should be noted that the third acquisition module 1201 and the first synthesis module 1202 mentioned above correspond to steps S61 and S62 in the embodiments. The two modules and the corresponding steps implement the same instances and application scenarios, but are not limited to the content disclosed in the above embodiments. It should be noted that the above modules or units can be hardware components or software components stored in memory and processed by one or more processors. The above modules can also run in the server 10 provided in the embodiments.
[0236] According to embodiments of this disclosure, an apparatus embodiment for implementing the above-described audio processing method is also provided. FIG13 is a schematic structural diagram of an audio processing apparatus according to an embodiment of this disclosure. As shown in FIG13, the apparatus includes:
[0237] The fourth acquisition module 1301 is configured to acquire the input text to be converted into virtual customer service voice, the virtual customer service prompt audio, and the corresponding prompt text transcription of the virtual customer service prompt audio.
[0238] The second synthesis module 1302 is configured to use a target emotional speech synthesis model to synthesize emotional speech from the input text, the virtual customer service prompt audio, and the transcribed prompt text, thereby obtaining the virtual customer service emotional speech; wherein, the target emotional speech synthesis model is obtained by using any of the above-mentioned model training methods.
[0239] By employing the embodiments of this disclosure, the input text to be converted into virtual customer service voice, the virtual customer service prompt audio, and the corresponding prompt text transcription are obtained. Then, a target emotional speech synthesis model is used to synthesize emotional speech from the input text, the virtual customer service prompt audio, and the prompt text transcription, thereby obtaining virtual customer service emotional speech. The target emotional speech synthesis model is obtained using any of the model training methods described above. This achieves the goal of generating high-quality and diverse emotional speech, thus realizing that the model training of the emotional dimension prediction model and the emotional speech synthesis model does not depend on emotional speech data. The emotional dimension prediction model improves the precision of emotional control and the naturalness of expression in complex emotional simulation of the emotional speech synthesis model, enabling the generation of high-quality and diverse emotional speech. This solves the technical problems of emotional speech synthesis systems in related technologies, such as dependence on emotional speech data, high training costs, and poor precision of emotional control and complex emotional simulation.
[0240] It should be noted that the fourth acquisition module 1301 and the second synthesis module 1302 mentioned above correspond to steps S71 and S72 in the embodiments. The two modules and their corresponding steps implement the same instances and application scenarios, but are not limited to the content disclosed in the above embodiments. It should be noted that the above modules or units can be hardware components or software components stored in memory and processed by one or more processors. The above modules can also run in the server 10 provided in the embodiments.
[0241] According to an embodiment of this disclosure, an apparatus embodiment for implementing the above-described audio processing method is also provided. FIG14 is a schematic structural diagram of an audio processing apparatus according to an embodiment of this disclosure. As shown in FIG14, the apparatus includes:
[0242] The fifth acquisition module 1401 is configured to acquire an audio processing request through the first application programming interface, wherein the request data carried in the audio processing request includes: target text, prompt audio, and the transcribed prompt text corresponding to the prompt audio;
[0243] The first return module 1402 is configured to return an audio processing response through a second application programming interface. The response data carried in the audio processing response includes: target emotional speech, which is obtained by synthesizing emotional speech from the target text, the prompt audio, and the transcribed prompt text using a target emotional speech synthesis model. The target emotional speech synthesis model is obtained by using any of the above-mentioned model training methods.
[0244] In this embodiment of the present disclosure, an audio processing request is obtained through a first application programming interface (API). The request data carried in the audio processing request includes: target text, prompt audio, and the corresponding prompt text transcription. Then, an audio processing response is returned through a second API. The response data carried in the audio processing response includes: target emotional speech. The target emotional speech is obtained by synthesizing the target text, prompt audio, and prompt text transcription using a target emotional speech synthesis model. The target emotional speech synthesis model is obtained using any of the above-mentioned model training methods. This achieves the goal of generating high-quality and diverse emotional speech, thereby realizing that the model training of the emotional dimension prediction model and the emotional speech synthesis model does not depend on emotional speech data. The emotional dimension prediction model improves the precision of emotional control and the naturalness of expression in complex emotional simulation of the emotional speech synthesis model, enabling the generation of high-quality and diverse emotional speech. This solves the technical problems of emotional speech synthesis systems in related technologies, such as dependence on emotional speech data, high training costs, and poor precision of emotional control and complex emotional simulation.
[0245] It should be noted that the fifth acquisition module 1401 and the first return module 1402 mentioned above correspond to steps S81 and S82 in the embodiments. The two modules and the corresponding steps implement the same instances and application scenarios, but are not limited to the content disclosed in the above embodiments. It should be noted that the above modules or units can be hardware components or software components stored in memory and processed by one or more processors. The above modules can also run in the server 10 provided in the embodiments.
[0246] According to an embodiment of this disclosure, an apparatus embodiment for implementing the above-described audio processing method is also provided. FIG15 is a schematic structural diagram of an audio processing apparatus according to an embodiment of this disclosure. As shown in FIG15, the apparatus includes:
[0247] The sixth acquisition module 1501 is configured to acquire the currently input audio processing dialogue request, wherein the request data carried in the audio processing dialogue request includes: target text, prompt audio, and the transcribed prompt text corresponding to the prompt audio;
[0248] The second return module 1502 is configured to return an audio processing dialogue response in response to an audio processing dialogue request. The audio processing dialogue response carries information including: target emotional speech, which is obtained by synthesizing emotional speech from the target text, the prompt audio, and the transcribed prompt text using a target emotional speech synthesis model. The target emotional speech synthesis model is obtained by training any of the above-mentioned model methods.
[0249] The playback module 1503 is configured to play the target emotional voice through a speaker device.
[0250] By employing the embodiments of this disclosure, an audio processing dialogue request is obtained from the current input. The request data carried in the audio processing dialogue request includes: target text, prompt audio, and the corresponding prompt text transcription. Then, in response to the audio processing dialogue request, an audio processing dialogue reply is returned. The information carried in the audio processing dialogue reply includes: target emotional speech. The target emotional speech is obtained by synthesizing emotional speech from the target text, prompt audio, and prompt text transcription using a target emotional speech synthesis model. The target emotional speech synthesis model is obtained using any of the above-mentioned model training methods, which will not be elaborated here. Finally, the target emotional speech is played through a speaker to provide feedback to the user. This achieves the goal of generating high-quality and diverse emotional speech, thereby realizing that the model training of the emotional dimension prediction model and the emotional speech synthesis model does not depend on emotional speech data. The emotional dimension prediction model improves the precision of emotional control and the naturalness of expression in complex emotional simulation of the emotional speech synthesis model, enabling the generation of high-quality and diverse emotional speech. This solves the technical problems of emotional speech synthesis systems in related technologies, such as dependence on emotional speech data, high training costs, and poor precision of emotional control and complex emotional simulation.
[0251] It should be noted that the sixth acquisition module 1501, the second return module 1502, and the playback module 1503 mentioned above correspond to steps S91 to S93 in the embodiments. The three modules and their corresponding steps implement the same instances and application scenarios, but are not limited to the content disclosed in the above embodiments. It should be noted that the above modules or units can be hardware or software components stored in memory and processed by one or more processors. The above modules can also run in the server 10 provided in the embodiments.
[0252] It should be noted that the preferred implementation schemes involved in the above embodiments of this disclosure are the same as the schemes, application scenarios and implementation processes provided in the embodiments, but are not limited to the schemes provided in the embodiments.
[0253] Embodiments of this disclosure can provide a computing device. FIG16 is a structural block diagram of a computing device according to an embodiment of this disclosure. As shown in FIG16, the computing device A may include: one or more (only one is shown in FIG16) processors 1602, memory 1604, memory controller, and peripheral interfaces, wherein the peripheral interfaces may connect to radio frequency modules, audio modules, and displays, etc., without limitation.
[0254] The aforementioned computing device A can be understood as an integrated smart terminal, including but not limited to servers, desktop computers, PCs (Personal Computers), all-in-one model machines, etc., and the computing device may have the model described in the above embodiments of this disclosure pre-installed.
[0255] Specifically, computing device A can pre-install various types of models, including but not limited to models in natural language processing, visual processing, speech processing, code processing, and multimodal task processing, thus providing diverse model selection. In different product forms, computing device A can support one or more model usage methods, including but not limited to model training, model invocation, model fine-tuning, model deployment, model inference, and application. In some product forms, computing device A also supports model management, including but not limited to multi-type model management (supporting the management of discriminative, generative, and other model types), model version control (supporting the control of different model versions), and model evaluation (evaluating model performance and effectiveness based on model evaluation tools). In other product forms, computing device A can also create applications based on models, providing API calling capabilities, allowing models to be called into created applications through API interfaces, and providing application management tools for application management and monitoring.
[0256] Furthermore, the computing device A can also include data management (supporting the creation and management of model tuning datasets), a training center (providing abundant training resources to help users learn and master AI technology), and basic control capabilities (providing enterprise-level basic control capabilities to ensure the security and efficient operation of the system). Through the above functions, it provides a comprehensive and integrated device for AI development, training, deployment, and application.
[0257] The memory can be used to store software programs and modules, such as the program instructions / modules corresponding to the methods and apparatus in the embodiments of this disclosure. The processor executes various functional applications and data processing by running the software programs and modules stored in the memory, thereby implementing the methods in the above embodiments. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory remotely located relative to the processor, and these remote memories can be connected to the terminal via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
[0258] The processor can invoke an executable program stored in memory via a transmission device to execute the method described in any of the above embodiments.
[0259] It will be understood by those skilled in the art that the structure shown in FIG16 is merely illustrative, and computing device A may also be a smartphone, tablet computer, handheld computer, mobile internet device (MID), PAD, or other terminal device. FIG16 does not limit the structure of the aforementioned computing device. For example, computing device A may include more or fewer components (such as network interface, display device, etc.) than shown in FIG16, or may have a different configuration than shown in FIG16.
[0260] Embodiments of this disclosure can provide an electronic device. FIG17 is a structural block diagram of an electronic device according to an embodiment of this disclosure. As shown in FIG17, the electronic device may include: an input / output device 172; a memory 174; and a processor 176, wherein the processor 176 is connected to the input / output device 172 and the memory 174 via a bus 178.
[0261] The memory can be used to store software programs and modules, such as the program instructions / modules corresponding to the methods and apparatus in the embodiments of this disclosure. The processor executes various functional applications and data processing by running the software programs and modules stored in the memory, thereby implementing the methods in the above embodiments. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory remotely located relative to the processor, and these remote memories can be connected to the terminal via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
[0262] Those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be implemented by a program instructing the hardware related to the terminal device. The program can be stored in a computer-readable storage medium, which may include: flash drive, read-only memory (ROM), random access memory (RAM), disk or optical disk, etc.
[0263] Embodiments of this disclosure also provide a computer-readable storage medium. Optionally, in this embodiment, the computer-readable storage medium can be used to store program code executed by the model training method or audio processing method provided in the above embodiments.
[0264] Optionally, in this embodiment, the computer-readable storage medium may be located in any computer terminal in a group of computer terminals in a computer network, or in any mobile terminal in a group of mobile terminals.
[0265] Embodiments of this disclosure also provide a computer program product comprising a computer program that, when executed by a processor, implements any of the above-described model training methods or audio processing methods.
[0266] In the above embodiments of this disclosure, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0267] In the several embodiments provided in this disclosure, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are merely illustrative; for example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the displayed or discussed mutual couplings, direct couplings, or communication connections may be through some interfaces; indirect couplings or communication connections between units or modules may be electrical or other forms.
[0268] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0269] Furthermore, the functional units in the various embodiments of this disclosure can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0270] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this disclosure, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this disclosure. The aforementioned storage medium includes various media capable of storing program code, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard drive, magnetic disk, or optical disk.
[0271] The above description is only a preferred embodiment of this disclosure. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principles of this disclosure, and these improvements and modifications should also be considered within the scope of protection of this disclosure.
Claims
1. A model training method, comprising: Obtain an emotional speech dataset and a predefined anchor set, wherein the predefined anchor set is used to define multiple different types of emotional categories; The initial emotion dimension prediction model is trained based on the emotional speech dataset and the predefined anchor point set to generate a target emotion dimension prediction model. The target emotion dimension prediction model is used to map the discrete emotion feature vectors corresponding to the emotional speech dataset to a multi-dimensional emotion space to guide the emotional speech synthesis model to predict the acoustic details corresponding to the prompt audio.
2. The model training method according to claim 1, wherein, The initial sentiment dimension prediction model includes a feature transformation module, a classification module, and an anchoring dimensionality reduction module. The initial sentiment dimension prediction model is trained based on the sentiment speech dataset and the predefined anchor point set to generate the target sentiment dimension prediction model, which includes: The feature conversion module is used to perform feature conversion on the emotional speech dataset to obtain emotional feature vectors; The classification module is used to classify the emotional feature vector into emotional states to obtain emotional category target tags; The anchoring dimensionality reduction module is guided by the predefined anchor point set to reduce the dimensionality of the emotion feature vector, thereby obtaining the emotion dimension vector; The initial sentiment dimension prediction model is trained based on the sentiment feature vector, the sentiment category target tag, and the sentiment dimension vector to generate the target sentiment dimension prediction model.
3. The model training method according to claim 2, wherein, The feature transformation module includes a pre-trained acoustic model and a linear module. The feature transformation module is used to perform feature transformation on the emotional speech dataset to obtain the emotional feature vector, which includes: The pre-trained acoustic model is used to extract features from the emotional speech dataset to obtain acoustic feature vectors. The acoustic feature vector is linearly transformed using the linear module to obtain the emotional feature vector, wherein the feature dimension of the emotional feature vector is a preset fixed value.
4. The model training method according to claim 3, wherein, The initial sentiment dimension prediction model is trained based on the sentiment feature vector, the sentiment category target tag, and the sentiment dimension vector to generate the target sentiment dimension prediction model, including: Based on the emotional category target tag, the emotional feature vector is classified to obtain a classification feature vector, wherein the classification feature vector is a feature vector obtained by classifying according to different emotional categories; Determine the target loss between the classification feature vector and the sentiment feature vector; The parameters of the linear module, the classification module, and the anchoring dimensionality reduction module are updated based on the target loss to generate the target sentiment dimension prediction model.
5. The model training method according to claim 4, wherein, Based on the sentiment category target tag, the sentiment feature vector is classified to obtain the classification feature vector, which includes: A nearest neighbor graph is created using the sentiment feature vectors, wherein the nearest neighbor graph is used to reflect the degree of similarity between the sentiment feature vectors; In the nearest neighbor graph, similar vectors in the emotion feature vector are classified according to the emotion category target tag to obtain the classification feature vector.
6. The model training method according to claim 2, wherein, The initial value of the emotional feature vector is determined based on the predefined anchor point set and preset Gaussian noise.
7. The model training method according to claim 2, wherein, The emotional dimension vector is a representation vector of a three-dimensional emotional space, which includes: pleasure, arousal, and dominance.
8. A model training method, comprising: Obtain the training text, training audio, and the corresponding training text transcription of the training audio; The training audio is predicted for sentiment dimension using a target sentiment dimension prediction model to obtain the sentiment dimension vector corresponding to the training audio, wherein the target sentiment dimension prediction model is obtained by the model training method described in any one of claims 1 to 6; The initial emotional speech synthesis model is trained based on the training text, the training audio, the training text transcription, and the emotion dimension vector to generate a target emotional speech synthesis model. The target emotional speech synthesis model is used to synthesize the target text, the prompt audio, and the prompt text transcription corresponding to the prompt audio to obtain the target emotional speech.
9. The model training method according to claim 8, wherein, The initial emotional speech synthesis model includes: an initial autoregressive language model and an initial non-autoregressive language model. The initial emotional speech synthesis model is trained based on the training text, the training audio, the training text transcription corresponding to the training audio, and the emotional dimension vector corresponding to the training audio. Generating the target emotional speech synthesis model includes: Obtain the first phoneme input corresponding to the training text, the second phoneme input corresponding to the training text transcription, the phoneme feature vector corresponding to the training audio, and the acoustic feature vector corresponding to the training audio; The initial autoregressive language model is trained using the first phoneme input, the second phoneme input, and the phoneme feature vector to generate the target autoregressive language model; The initial non-autoregressive language model is trained using the first phoneme input, the second phoneme input, the predicted phoneme feature vector output by the initial autoregressive language model, the acoustic feature vector, and the sentiment dimension vector to generate the target non-autoregressive language model. The target emotional speech synthesis model is constructed based on the target autoregressive language model and the target non-autoregressive language model.
10. The model training method according to claim 9, wherein, Obtaining the phoneme feature vector corresponding to the training audio includes: The training audio is used to extract features using a pre-trained acoustic model to obtain acoustic feature vectors; The acoustic feature vector is transformed using a linear module to obtain the phoneme feature vector.
11. An audio processing method, comprising: Obtain the target text, the prompt audio, and the transcribed prompt text corresponding to the prompt audio; A target emotional speech synthesis model is used to synthesize emotional speech from the target text, the prompt audio, and the transcribed prompt text to obtain the target emotional speech. The target emotional speech synthesis model is obtained by using the model training method described in any one of claims 7 to 9.
12. The audio processing method according to claim 11, wherein, The target emotional speech synthesis model includes: a target autoregressive language model, a target non-autoregressive language model, and a target audio decoder. The target emotional speech synthesis model is used to synthesize emotional speech from the target text, the prompt audio, and the transcribed prompt text, resulting in the target emotional speech, which includes: Obtain the third phoneme input corresponding to the target text, the fourth phoneme input corresponding to the transcription of the prompt text, the emotional dimension vector corresponding to the prompt audio, the phoneme feature vector corresponding to the prompt audio, and the acoustic feature vector corresponding to the prompt audio; The third phoneme input, the fourth phoneme input, and the phoneme feature vector are input into the target autoregressive language model, and the predicted phoneme feature vector is output. The third phoneme input, the fourth phoneme input, the predicted phoneme feature vector, the acoustic feature vector, and the emotion dimension vector are input into the target non-autoregressive language model, and the predicted acoustic feature vector is output. The target audio decoder is used to decode the predicted acoustic feature vector to obtain the target emotional speech.
13. An audio processing method, comprising: The system acquires the input text to be converted into virtual customer service voice, the virtual customer service prompt audio, and the corresponding prompt text transcribed from the virtual customer service prompt audio. A target emotional speech synthesis model is used to synthesize emotional speech from the input text, the virtual customer service prompt audio, and the transcribed prompt text to obtain the virtual customer service emotional speech. The target emotional speech synthesis model is obtained by using the model training method described in any one of claims 7 to 9.
14. An audio processing method, comprising: An audio processing request is obtained through a first application programming interface, wherein the request data carried in the audio processing request includes: target text, prompt audio, and a transcript of the prompt audio corresponding to the prompt text; The audio processing response is returned through the second application programming interface, wherein the response data carried in the audio processing response includes: target emotional speech, which is obtained by performing emotional speech synthesis on the target text, the prompt audio and the transcribed prompt text using a target emotional speech synthesis model, and the target emotional speech synthesis model is obtained by the model training method described in any one of claims 7 to 9.
15. An audio processing method, comprising: Obtain the currently input audio processing dialogue request, wherein the request data carried in the audio processing dialogue request includes: target text, prompt audio, and the transcribed prompt text corresponding to the prompt audio; In response to the audio processing dialogue request, an audio processing dialogue response is returned, wherein the information carried in the audio processing dialogue response includes: target emotional speech, which is obtained by synthesizing emotional speech from the target text, the prompt audio and the transcribed prompt text using a target emotional speech synthesis model, and the target emotional speech synthesis model is obtained by the model training method described in any one of claims 7 to 9; The target emotional voice is played through a speaker.
16. An electronic device comprising: Memory, which stores executable programs; A processor for running the program, wherein the program, when running, performs the model training method according to any one of claims 1 to 10 or the audio processing method according to any one of claims 11 to 15.
17. A computer-readable storage medium comprising a stored executable program, wherein, When the executable program is executed, it controls the device containing the computer-readable storage medium to perform the model training method according to any one of claims 1 to 10 or the audio processing method according to any one of claims 11 to 15.
18. A computer program product comprising a computer program that, when executed by a processor, implements the model training method of any one of claims 1 to 10 or the audio processing method of any one of claims 11 to 15.