Method for training an audio generation model and audio generation method
By optimizing the audio generation model using a pre-trained model and multiple loss functions, the decoupled control of audio style and semantics is achieved, solving the problems of insufficient user-friendliness and flexibility of style-controllable TTS in existing technologies, and improving the accuracy and adaptability of generated audio.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NANJING SILICON INTELLIGENCE TECH CO LTD
- Filing Date
- 2026-01-22
- Publication Date
- 2026-06-16
Smart Images

Figure CN121545495B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of artificial intelligence technology, and in particular to a training method for an audio generation model and an audio generation method. Background Technology
[0002] In recent years, speech synthesis technology has achieved remarkable leaps in development, demonstrating outstanding results in mimicking the emotional nuances and prosodic control of human voices, while simultaneously reducing the difficulty of generation and the barrier to entry for use. Against this backdrop, text-guided text-to-speech (TTS) has become a research focus. Its core objective is to precisely control the synthesized speech's emotional tone, gender, speech rate, intonation, age group, and other diverse stylistic dimensions through natural language text commands, thereby meeting users' personalized needs.
[0003] Existing style-controllable TTS technologies mainly rely on two approaches: style transfer from reference audio or direct setting of style parameters. However, both have significant limitations: the former requires users to select matching samples from an audio library, while the latter requires users to have professional acoustic knowledge to set precise parameters. Both reduce the system's flexibility and user-friendliness, and it is difficult to find reference samples that perfectly match personalized needs. More importantly, existing technologies cannot effectively separate text content and style features in speech. Style expression is easily interfered with by the semantic tendencies of the text, making it difficult to generate stylized speech that meets expectations. How to solve these difficulties has become an urgent problem to be solved. Summary of the Invention
[0004] This application provides a training method for an audio generation model and an audio generation method that can decouple the text and style parts of the audio as much as possible, thereby enabling the trained audio generation model to more accurately learn the correspondence between style description and audio style.
[0005] To achieve the above objectives, the embodiments of this application adopt the following technical solutions:
[0006] In a first aspect, embodiments of this application provide a training method for an audio generation model. The method includes: encoding an audio sample set using a pre-trained model to obtain encoded features; the audio sample set includes audio samples, audio recognition text, and audio description text; the audio description text is used to describe the audio style of the audio samples; extracting features from the encoded features using a model to be trained to obtain audio features, recognition text features, and description text features; determining a first loss based on the audio features and the recognition text features, and determining a second loss based on the audio features and the description text features; and adjusting the parameters of the model to be trained based on the first loss, the second loss, and a third loss of the pre-trained model to obtain an audio generation model.
[0007] Based on this scheme, the semantic accuracy and style matching of the generated audio are ensured by jointly optimizing the features of audio, audio-recognized text, and audio-described text. The prior knowledge of the pre-trained model is used to reduce the training difficulty, improve the feature encoding quality, and shorten the training cycle. The joint constraint of multiple losses reduces model overfitting, enhances the adaptability to new samples, and improves the matching of the content and style of the generated audio.
[0008] In another possible implementation, the method further includes: encoding the audio sample using an initial audio encoder to obtain initial audio features, and encoding the audio recognition text using a prior encoder of an initial speech synthesis module to obtain initial recognition text features; performing feature transformation on the initial audio features using an initial style adapter to obtain feature transformation results; performing speech synthesis using the initial speech synthesis module based on the audio linear spectrum extracted from the audio sample, the feature transformation results, and the initial recognition text features to obtain initial audio; determining the third loss based on the initial audio and a reference audio, and adjusting the parameters of the initial encoder, the initial style adapter, and the initial speech synthesis module according to the third loss to obtain a pre-trained model containing a pre-trained audio encoder, a pre-trained style adapter, and a pre-trained speech synthesis module.
[0009] Based on this scheme, by separating audio encoding and text encoding branches and combining feature transformation of the style adapter, the decoupled control of audio style and semantic content is achieved, improving the style customization flexibility of speech synthesis. Synthesis is jointly driven by audio linear spectrum, transformation features and text features, and the sound quality and semantic accuracy of synthesized audio are ensured by optimizing the loss between the initial audio and the reference audio. Joint parameter adjustment of the encoder, style adapter and speech synthesis module is performed to form an integrated pre-trained model, reducing the training cost of subsequent tasks and improving the model's generalization ability.
[0010] In another possible implementation, the model to be trained includes the pre-trained model, a pre-trained text encoder, and multiple query transformers to be trained; wherein any query transformer includes a group attention unit, a cross attention unit, a feature filtering unit, and a normalization unit; correspondingly, the step of extracting features from the encoded features through the model to be trained to obtain audio features, recognition text features, and descriptive text features includes: calculating attention weights on the encoded features through the group attention unit to obtain a first feature sequence; calculating attention weights on the first feature sequence through the cross attention unit to obtain a second feature sequence; filtering the second feature sequence through the feature filtering unit to obtain a third feature sequence; and normalizing the third feature sequence through the normalization unit to obtain the audio features, the recognition text features, and the descriptive text features.
[0011] Based on this scheme, the model can more accurately locate key features related to audio generation through a dual-layer weighted approach of group attention and cross attention. The targeted filtering of the feature selection unit removes invalid features and retains core features that are strongly related to audio semantics and style, reducing interference from subsequent loss calculations and improving model training efficiency. Normalization can eliminate the distribution differences between different feature sequences, placing features in a uniform numerical range and avoiding gradient explosion or vanishing problems caused by uneven feature distribution. It also improves the compatibility of multi-feature fusion and ensures the consistency of the final output audio features, recognized text features, and descriptive text features.
[0012] In another possible implementation, the above-mentioned calculation of attention weights for the encoded features by the group attention unit to obtain a first feature sequence includes: performing vector transformation on the query vector through the group attention unit to obtain a first group query matrix, a first group key matrix, and a first group value matrix; wherein the query vector is used to extract semantics related to text content from the audio features included in the encoded features; splitting the first group query matrix, the first group key matrix, and the first group value matrix to obtain a first sub-group query matrix, a first sub-group key matrix, and a first sub-group value matrix, and using the first sub-group query matrix as a first query group, and the first sub-group key matrix and the first sub-group value matrix as a first key-value group; wherein any first key-value group corresponds to multiple first query groups; calculating attention weights through the first query group to obtain a first weight matrix; assigning weights to the first sub-group value matrices in the first key-value groups corresponding to the first query group based on the first weight matrix to obtain weighted first sub-group value matrices; and concatenating the weighted first sub-group value matrices to obtain a first feature sequence corresponding to the audio encoded features.
[0013] Based on this scheme, by using grouping and splitting and a one-to-many (query group) mapping mechanism, targeted attention calculations can be performed on text semantic association features of different dimensions in audio features, avoiding the semantic feature ambiguity caused by single matrix calculations. By weighting the numerical matrix of the corresponding key group with the weight matrix calculated by the query group, audio semantic features related to the text content can be highlighted, irrelevant background features can be weakened, and the semantic directionality of the features can be improved. By concatenating the weighted subgroup value matrices, the local semantic details brought by the grouping calculations can be preserved, and the complete feature sequence matching the audio encoding features can be restored, providing high-quality input for subsequent cross-attention calculations.
[0014] In another possible implementation, the above-mentioned calculation of attention weights for the encoded features by the group attention unit to obtain the first feature sequence includes: performing vector transformation on the identified text features and / or descriptive text features through the group attention unit to obtain a second group query matrix, a second group key matrix, and a second group value matrix; splitting the second group query matrix, the second group key matrix, and the second group value matrix to obtain a second sub-group query matrix, a second sub-group key matrix, and a second sub-group value matrix, and using the second sub-group query matrix as a second query group, and the second sub-group key matrix and the second sub-group value matrix as a second key-value group; wherein any second key-value group corresponds to multiple second query groups; calculating attention weights through the second query group to obtain a second weight matrix; assigning weights to the second sub-group value matrices in the second key-value groups corresponding to the second query group based on the second weight matrix to obtain weighted second sub-group value matrices; and concatenating the weighted second sub-group value matrices to obtain the first feature sequence corresponding to the identified text encoding features or the descriptive text encoding features.
[0015] Based on this scheme, through matrix transformation, group splitting, weighting of multiple query groups with one-key value groups, and group attention processing, refined semantic mining and directionality enhancement of text features are achieved, providing high-quality text feature input for subsequent cross-modal feature matching.
[0016] In another possible implementation, the above-mentioned calculation of attention weights on the first feature sequence through the cross-attention unit to obtain the second feature sequence includes: converting the first feature sequence into a cross-query matrix and converting the encoded features into a cross-key matrix and a cross-value matrix through the cross-attention unit; splitting the cross-query matrix, the cross-key matrix, and the cross-value matrix to obtain multiple attention heads; wherein each attention head corresponds to a second sub-cross-query matrix, a second sub-cross-key matrix, and a second sub-cross-value matrix; performing feature filtering based on the attention heads to obtain a masking matrix, and calculating attention weights based on the masking matrix to obtain a third weight matrix; assigning weights to the second sub-cross-value matrix based on the third weight matrix to obtain a weighted second sub-cross-value matrix; and concatenating the weighted second sub-cross-value matrix to obtain a second feature sequence corresponding to the audio encoding features, the recognized text encoding features, or the descriptive text encoding features.
[0017] Based on this scheme, through matrix transformation, group splitting, one-click value group to multiple query groups, and weighted concatenation processing logic, it is possible to finely mine local key information related to audio style and semantics in text features, avoid feature generalization problems caused by single matrix attention calculation, and improve the semantic orientation of text features and improve the alignment accuracy of multimodal features.
[0018] In another possible implementation, determining the first loss based on the audio features and the recognized text features includes: calculating conditional probabilities based on the audio features and the recognized text features; calculating a logarithmic ratio based on the conditional probabilities and summing the logarithmic ratios to obtain a ratio sum; and determining the first loss based on the ratio sum and the number of features of the audio features and the recognized text features.
[0019] Based on this scheme, the first loss is calculated by summing the logarithmic ratio of conditional probabilities and the number of features. Through probability quantization and numerical normalization, the degree of matching between audio features and recognized text features is accurately measured, avoiding loss calculation deviations caused by differences in feature dimensions and strengthening the constraint on the semantic accuracy of generated audio.
[0020] In another possible implementation, the method further includes: acquiring dialogue audio from multiple users; the dialogue audio includes emotion tags; identifying a first style description text corresponding to the dialogue audio using an audio understanding model; using the dialogue audio as a real audio sample in the audio sample, and using the first style description text as the real audio description text in the audio description text.
[0021] Based on this solution, user dialogue audio with emotion tags is introduced, and corresponding style description text is automatically generated as real samples. Training data that fits real application scenarios is constructed, so that the audio style learned by the model is strongly correlated with the actual human dialogue emotions, thereby improving the naturalness and scene adaptability of the generated audio.
[0022] In another possible implementation, the method further includes: acquiring dialogue text containing emotion tags, and copying the dialogue audio to obtain copied audio; generating audio based on the dialogue text, the dialogue audio, the copied audio, and the emotion tags to obtain generated audio samples; identifying the second style description text corresponding to the generated audio samples through an audio understanding model, using the generated audio samples as virtual audio samples in the audio samples, and using the second style description text as virtual audio description text in the audio description text.
[0023] Based on this solution, virtual audio samples are generated by copying real dialogue audio and combining it with emotion tags. This expands the training sample size without collecting additional real data, while ensuring that the emotional style of the virtual samples is consistent with that of the real samples. This reduces data collection costs and solves the problem of model overfitting in small sample scenarios.
[0024] In another possible implementation, the method further includes: classifying the audio description text according to the emotion tag to obtain a first-level audio description text; classifying the first-level audio description text according to the emotion sub-tags under the emotion tag to obtain a second-level audio description text; generating a positive sample set, a first negative sample set, and a second negative sample set based on the emotion tag, the emotion sub-tags, the first-level audio description text, and the second-level audio description text; wherein, the positive sample set includes the second-level audio description text under the emotion sub-tag of any emotion tag, the first negative sample set includes the second-level audio description text under other emotion sub-tags of any emotion tag, and the second negative sample set includes the first-level audio description text under other emotion tags.
[0025] Based on this approach, text is classified and described according to the hierarchy of "emotion label - emotion sub-label" and a three-level positive and negative sample set is constructed. This allows the model to learn the hierarchical differences in style during training, enabling it to distinguish between major emotion categories such as "happy" and "sad" and to identify sub-styles under the same emotion such as "laughing" and "smiling", thereby improving the refinement of style generation.
[0026] In another possible implementation, determining the second loss based on the audio features and the descriptive text features includes: determining a positive sample loss based on the audio features and the descriptive text features of the secondary audio descriptive text in the positive sample set; determining a first negative sample loss based on the audio features and the descriptive text features of the secondary audio descriptive text in the first negative sample set; determining a second negative sample loss based on the audio features and the descriptive text features of the primary audio descriptive text in the second negative sample set; and using the sum of the positive sample loss, the first negative sample loss, and the second negative sample loss as the second loss.
[0027] Based on this scheme, a contrastive loss function is formed by combining positive sample loss, first negative sample loss, and second negative sample loss. By constraining the difference between positive and negative samples, the model strengthens the learning of effective associations (positive samples) and weakens ineffective associations (negative samples). The fusion of multi-sample losses allows the model to more clearly distinguish the feature differences between different samples, improves the discriminative power of audio and text association, and avoids problems such as style confusion and content deviation in generated audio. The contrastive loss design is adapted to multi-level classification sample sets, further strengthening the model's ability to capture style features and improving the style adaptability of generated audio.
[0028] In another possible implementation, the step of encoding the audio sample set through a pre-trained model to obtain encoding features includes: encoding the audio samples through a pre-trained audio encoder of the pre-trained model to obtain audio encoding features; encoding the audio description text through a pre-trained text encoder to obtain the recognition text encoding features; and encoding the audio recognition text through a pre-trained speech synthesis module of the pre-trained model to obtain description text encoding features.
[0029] Based on this scheme, three dedicated encoders of the pre-trained model are used to process audio samples, descriptive text, and recognition text respectively, so as to realize independent encoding and differentiated representation of multimodal features, give full play to the prior advantages of the pre-trained model on each modal data, improve the quality of encoded features, and lay a high-quality foundation for subsequent feature extraction.
[0030] In another possible implementation, the step of extracting features from the encoded features using the model to be trained to obtain audio features, recognition text features, and descriptive text features includes: extracting features from the audio encoded features using a first query transformer of the model to be trained to obtain the audio features; extracting features from the recognition text encoded features using a second query transformer of the model to be trained to obtain the descriptive text features; and extracting features from the descriptive text encoded features using a third query transformer of the model to be trained to obtain the recognition text features.
[0031] Based on this scheme, three independent query transformers to be trained are configured to extract three types of encoded features respectively, thereby achieving dedicated optimization of feature extraction, avoiding mutual interference between different modal features during the extraction process, improving the purity of audio features, recognition text features, and description text features, and enhancing the matching accuracy between multimodal features.
[0032] Secondly, embodiments of this application provide an audio generation method, comprising: obtaining audio content text and an audio style description; generating audio using an audio generation model based on the audio content text and the audio style description to obtain a target audio; wherein the audio style of the target audio is consistent with the descriptive style of the audio style description; wherein the audio generation model is trained using the training method of the audio generation model described in the first aspect above.
[0033] Thirdly, embodiments of this application provide a training apparatus for an audio generation model. The apparatus includes: an encoding module for encoding an audio sample set using a pre-trained model to obtain encoded features; the audio sample set includes audio samples, audio recognition text, and audio description text; the audio description text describes the audio style of the audio samples; a feature extraction module for extracting features from the encoded features using a model to be trained to obtain audio features, recognition text features, and description text features; a loss calculation module for determining a first loss based on the audio features and the recognition text features, and determining a second loss based on the audio features and the description text features; and a parameter adjustment module for adjusting the parameters of the model to be trained based on the first loss, the second loss, and a third loss of the pre-trained model to obtain an audio generation model.
[0034] Fourthly, embodiments of this application provide an audio generation apparatus, comprising: a text acquisition module configured to acquire audio content text and an audio style description; and an audio generation module configured to generate audio based on the audio content text and the audio style description using an audio generation model to obtain target audio; wherein the audio style of the target audio is consistent with the descriptive style of the audio style description; and wherein the audio generation model is trained using the training method of the audio generation model in any of the first aspects described above.
[0035] Fifthly, embodiments of this application provide a computer-readable storage medium storing a computer program for executing the training method of the audio generation model provided in the first aspect or the audio generation method provided in the second aspect.
[0036] In a sixth aspect, embodiments of this application also provide an electronic device, including: one or more processors; and a memory configured to store one or more programs; wherein, when the one or more programs are executed by the one or more processors, the one or more processors implement a training method for an audio generation model as described in any of the first aspects or an audio generation method provided in the second aspect.
[0037] In a seventh aspect, embodiments of this application provide a computer program product that, when instructions in the computer program product are executed by a processor, executes the training method of the audio generation model provided in the first aspect or the audio generation method provided in the second aspect. Attached Figure Description
[0038] Figure 1 This is a schematic diagram of a model architecture provided for an embodiment of this application.
[0039] Figure 2 A flowchart illustrating a training method for an audio generation model provided in this application embodiment.
[0040] Figure 3 This is a schematic diagram of the structure of a Q-former provided in an embodiment of this application.
[0041] Figure 4 This is a flowchart of an audio generation method provided in an embodiment of this application.
[0042] Figure 5 This is a schematic diagram illustrating the workflow of an audio generation model provided in an embodiment of this application.
[0043] Figure 6 This is a schematic diagram of a training device for an audio generation model provided in an embodiment of this application.
[0044] Figure 7 This is a schematic diagram of an audio generation device provided in an embodiment of this application.
[0045] Figure 8 This is a schematic diagram of an electronic device provided in an embodiment of this application. Detailed Implementation
[0046] The technical solutions of the embodiments of this application will now be described with reference to the accompanying drawings. To facilitate a clear description of the technical solutions of the embodiments of this application, the use of terms such as "first," "second," etc., in the embodiments of this application is merely for illustration and to distinguish the objects being described. There is no particular order between them, nor does it indicate a specific limitation on the number of devices in the embodiments of this application, and they cannot constitute any limitation on the embodiments of this application.
[0047] In this embodiment, the initial training model, pre-trained model, model to be trained, and audio generation model share the same core model architecture, such as... Figure 1 As shown in the schematic diagram of the model architecture, the architecture specifically includes an audio encoder, an audio understanding model, a text encoder, a query transformer, a style adapter, and a speech synthesis module.
[0048] The audio encoder can be a hierarchical bidirectional encoder representation from Transformers (HuBERT model), which is used to encode audio samples to obtain audio coding features.
[0049] The query transformer can be a bridging network (Querying Transformer, Q-former); the query transformer is used to extract features from coded features to obtain audio features, recognize text features, and describe text features.
[0050] The text encoder may include a first text encoder and a second text encoder. The first text encoder may reuse the text encoder in the speech synthesis module. The first text encoder is used to encode the audio recognition text to obtain the recognition text encoding features. The second text encoder may be a bidirectional encoder representation model (RobustlyOptimized BERT Pretraining Approach, RoBERTa). The second text encoder is used to encode the audio description text to obtain the description text encoding features.
[0051] The style adapter can be composed of a linear layer and a Rectified Linear Unit (ReLU) activation function. The style adapter connects audio features, recognized text features, and descriptive text features to the speech synthesis module, allowing the features to better adapt to the speech synthesis module. The speech synthesis module can employ an end-to-end model (Variational Inference with adversarial learning for end-to-end Text-to-Speech, VITS) to generate natural, fluent audio that conforms to the required style.
[0052] Furthermore, the bridging network Q-former may include a group attention unit (i.e., a group query attention unit), a cross attention unit (i.e., a dynamic sparse cross attention unit), a feature selection unit (i.e., a hybrid gating mechanism unit), and a normalization unit (i.e., an adaptive normalization layer); among which, the feature selection unit is used to capture local features of speech.
[0053] The speech synthesis module may contain a prior encoder, a posterior encoder, a stochastic duration predictor, a decoder, and a discriminator; the prior encoder may also include a text encoder, a normalized flow, and a temporal alignment layer (MAS).
[0054] Furthermore, the grouped query attention unit comprises multiple stacked grouped query attention blocks. The dynamic sparse cross attention unit comprises multiple stacked dynamic sparse attention blocks. The hybrid gating mechanism unit comprises a gated linear unit (GLU) and a depthwise separable convolution.
[0055] The following details the data flow in the model architecture: On one hand, audio samples are input into an audio encoder for encoding to obtain audio encoded features. On the other hand, audio samples are input into an audio understanding model for audio understanding to obtain audio recognition text. The obtained audio recognition text is then input into a first text encoder for encoding to obtain recognition text encoded features. The audio description text is encoded by a second text encoder to obtain description text encoded features. These audio encoded features, recognition text encoded features, and description text encoded features are then input into a query transformer for feature extraction. The extracted audio features, recognition text features, and description text features are then input into a style adapter. The style adapter adjusts these features to suit the speech synthesis module. These adjusted features are then input into a posterior encoder and a normalized stream. Combined with the user identifier, audio linear spectrum, and input text, the speech synthesis module generates audio waveforms. These audio waveforms match the user's input text, achieving accurate audio generation according to user requirements. The noise output from the random duration predictor is used for model loss calculation.
[0056] It should be noted that the essential difference between the four models mentioned above lies in their parameter configurations. Specifically, the audio encoder, text encoder, query transformer, style adapter, and speech synthesis module included in the initial training model, pre-trained model, untrained model, and audio generation model each have independent parameter sets. Different model functions are achieved by adjusting these parameters. For example, the parameters of the pre-trained model are determined through the first stage of training based on the initial training model, while the parameters of the untrained model are iteratively optimized through the second stage of training on the pre-trained model.
[0057] The initial training model includes an initial training audio encoder, an initial training style adapter, and an initial training speech synthesis module. The pre-trained model includes a pre-trained audio encoder, a pre-trained style adapter, and a pre-trained speech synthesis module; the pre-trained model is trained based on the initial training model. During the training of the pre-trained model, the query transformer and the second text encoder do not participate in the training. The model to be trained includes the pre-trained model, a pre-trained text encoder, and multiple query transformers to be trained; the model to be trained is trained based on the pre-trained model. The audio generation model includes an audio encoder, a text encoder, multiple query transformers, a style adapter, and a speech synthesis module; the audio generation model is trained based on the model to be trained.
[0058] In practical applications, an audio clip consists of two abstract parts: one part focuses on textual expression, that is, the content corresponding to the audio, and the other part focuses on stylistic expression, that is, the audio's own speaking speed, tone, pitch, and emotion.
[0059] For example, in an audio clip, the speaker reads in a clear and loud voice with great enthusiasm, "Fighting day and night, we broke through thorns and overcame obstacles, and our dreams blossomed brightly at the finish line. Every drop of sweat shone with glory." In this case, features related to the content and text can be used as textual expressions, while features related to the style and content can be used as style expressions (such as passionate emotions).
[0060] In the process of extracting features for style expression, features can only be extracted from quantifiable parameters such as speech rate and pitch. For overall style expression, especially emotional expression, although corresponding features can be extracted, these features are inevitably affected by text features.
[0061] For example, the words "breaking through thorns and blooming brightly" in the above example have significant stylistic tendencies. After feature extraction based on this, the style represented by the feature will be more or less affected by the text.
[0062] The aforementioned issues make it difficult to generate ideal speech when the natural language text used for speech synthesis deviates from the training data during the user's audio generation process.
[0063] For example, a user wants to synthesize an audio recording of a sad text being read aloud in an indignant style. Typically, similar texts in the training data correspond to a sad style, so the model cannot ignore the influence of the text when processing the style, resulting in the actual synthesis effect failing to meet the user's needs.
[0064] To address the aforementioned issues, this embodiment provides a training method for an audio generation model. During model training, an audio sample set is encoded using a pre-trained model to obtain encoded features. The audio sample set includes audio samples, audio recognition text, and audio description text. The audio description text describes the audio style of the audio samples. The model to be trained extracts features from the encoded features to obtain audio features, recognition text features, and description text features. A first loss is determined based on the audio features and recognition text features, and a second loss is determined based on the audio features and description text features. The parameters of the model to be trained are adjusted based on the first loss, the second loss, and the third loss of the pre-trained model to obtain the audio generation model. The audio generation model trained in this way can decouple the text and style parts of the audio as much as possible, thereby enabling the model to more accurately learn the correspondence between style description and audio style.
[0065] Figure 2 This is a flowchart illustrating a training method for an audio generation model provided in an embodiment of this application. Figure 2 As shown, the method includes steps 201 to 204.
[0066] Step 201: Encode the audio sample set using a pre-trained model to obtain encoded features.
[0067] For example, the audio sample set includes audio samples, audio recognition text, and audio description text; the audio description text describes the audio style of the audio samples. Audio samples include audio recordings of conversations between users of different genders, ages, and emotions, such as an audio recording of a middle-aged woman expressing displeasure or a young man expressing excitement. Audio recognition text includes text obtained by recognizing the content of the audio samples; the audio recognition text is consistent with the content of the audio samples, for example, if the content of the audio sample is "We have won," the audio recognition text is "We have won." Audio description text refers to the text that describes the style of the audio samples. The audio style included in the style description text may include: pitch, speech rate, volume, the gender, emotion, and / or age of the speaking user, and may use keywords themselves or synonyms (such as "pitch" instead of "pitch"). Continuing with the example above, the audio content is "We have won," the transcribed text is "We have won," and the style description text is "Young man, high-pitched, fast-paced, full of excitement."
[0068] For example, the audio sample set includes a real audio sample set and a virtual audio sample set. The real audio sample set includes real audio samples, real audio recognition text, and real audio description text; the virtual audio sample set includes virtual audio samples, virtual audio recognition text, and virtual audio description text.
[0069] In some examples, during the construction of a real audio sample set, dialogue audio from multiple users can be obtained; the dialogue audio includes emotion tags; the first style description text corresponding to the dialogue audio is identified through an audio understanding model; the dialogue audio is used as a real audio sample in the audio sample set, and the first style description text is used as a real audio description text in the audio description text.
[0070] For example, real audio samples may include dialogue audio from multiple users. Real audio samples can be obtained based on public datasets, such as the Multi-modal Multi-scene Multi-label Emotional Dialogue Database (M3ED) or the Chinese Academy of Sciences Institute of Automation (CASIA).
[0071] Furthermore, based on audio understanding models, such as the SpeechAudio Language Music Open Neural Network (salmonn) or the AudioLanguage Model (qwen2-audio), corresponding realistic audio description text can be generated for the aforementioned dialogue audio. For example, known gender, emotion, and age information can be added using rule templates to constrain the large-scale audio understanding model and guide it to describe the speaker's five styles (pitch, speech rate, volume, gender, emotion, and age) in a single sentence.
[0072] For example, a sample rule template is set up as follows: Please describe the speaker's style in the input audio in one sentence using Chinese. The style description includes six keywords: pitch, speech rate, volume, gender, emotion, and age. You can use the keywords themselves or replace them with their synonyms. For example, you can use "pitch" or "volume" to describe "pitch." The description should be natural and concise. Based on the above template, if we input an audio clip of a middle-aged woman expressing dissatisfaction and resentment, the corresponding generated style description text would be: Her tone sounds like that of a middle-aged person. Her voice is very sharp, her speech is fast and loud, giving a feeling of impatience and displeasure. In this way, we can obtain the real audio description text (i.e., the first style description text) for each real audio sample in the real audio sample set.
[0073] Furthermore, to compensate for the lack of real audio sample sets and provide sufficient, diverse, and high-quality training samples covering complex styles for model training, virtual audio sample sets can be generated based on real audio sample sets. In some examples, dialogue text containing emotion tags is obtained, and the dialogue audio is copied to obtain copied audio; audio is generated based on the dialogue text, dialogue audio, copied audio, and emotion tags to obtain generated audio samples; the second-style descriptive text corresponding to the generated audio samples is identified through an audio understanding model, and the generated audio samples are used as virtual audio samples in the audio sample set, and the second-style descriptive text is used as virtual audio descriptive text in the audio descriptive text set.
[0074] For example, in the process of constructing a virtual audio sample set, on the one hand, audio is acquired. In this process, some real audio samples are selected, and audio synthesis is performed using the zero-shot method. The synthesized audio is then filtered to obtain duplicate audio. For example, audio with obvious emotions can be selected during the filtering process. The main emotions include happiness, surprise, sadness, disgust, anger, fear, and neutrality, as well as other possible emotions. On the other hand, dialogue text is acquired. In this process, a public text dataset is used as the text input for audio synthesis. The dataset is such as the SMP2020 general dataset, which contains text corresponding to six emotions: positive, angry, sad, fear, surprise, and no emotion.
[0075] In the process of audio generation based on dialogue text, dialogue audio, copied audio, and emotion tags, virtual audio samples can be generated using a speech synthesis model based on a seed dataset. For example, the speech synthesis model could be fish-speech, which uses the aforementioned text and audio datasets to synthesize virtual audio samples with different emotions. During synthesis, the emotion tags in the text must correspond to the emotion tags in the speech. Furthermore, by adjusting different parameters of the speech synthesis model, including speech rate and pauses, similar contextual speech can be generated.
[0076] For example, the virtual audio samples obtained above are generated in the same way as the real audio samples in the real audio samples, and the virtual audio samples are input into the audio understanding model to generate the virtual audio description text (i.e., the second style description text) corresponding to the virtual audio samples.
[0077] In some examples, audio sample sets are encoded using pre-trained models to obtain encoded features, including: encoding audio samples using a pre-trained audio encoder of the pre-trained model to obtain audio encoded features; encoding audio descriptive text using a pre-trained text encoder to obtain descriptive text encoded features; and encoding audio recognition text using a pre-trained speech synthesis module of the pre-trained model to obtain recognition text encoded features.
[0078] For example, an audio sample set can be encoded using a pre-trained encoder of a pre-trained model. After pre-training, the parameters of the pre-trained encoder are fixed, and the parameters of the pre-trained encoder do not change during the secondary training process based on the pre-trained encoder.
[0079] Since the audio sample set contains three types of samples—audio samples, audio recognition text, and / or audio description text—three types of pre-trained encoders can be configured to process the samples more effectively and improve the efficiency of encoding. These pre-trained encoders include an audio encoder, a first text encoder, and / or a second text encoder. Correspondingly, the encoding features include audio encoding features obtained by encoding based on audio samples using the audio encoder, recognition text encoding features obtained by encoding based on audio recognition text using the first text encoder, and description text encoding features obtained by encoding based on audio description text using the second text encoder.
[0080] The audio encoder encodes audio samples and extracts audio-related encoding features (i.e., speech embeddings), which include various aspects such as speech content, style, and speaker timbre. The audio encoder directly outputs audio encoding features containing both textual and stylistic information. The first text encoder encodes the audio recognition text (i.e., transcribed text) corresponding to the audio sample, obtaining the recognition text encoding features (i.e., text embeddings). The first text encoder can be reused with the text encoder module in VITS, meaning it directly uses the text encoder module in the VITS model to encode the audio. The second text encoder encodes the audio description text and extracts description text encoding features (i.e., style embeddings). The second text encoder can be implemented based on the bidirectional encoder representation model (RoBERTa).
[0081] In some examples, the pre-trained model can be trained as follows: Initial audio features are obtained by encoding audio samples using an initial audio encoder, and initial recognition text features are obtained by encoding audio recognition text using a prior encoder of the initial speech synthesis module; the initial audio features are transformed using an initial style adapter to obtain the feature transformation result; based on the audio linear spectrum extracted from the audio samples, the feature transformation result, and the initial recognition text features, speech is synthesized using an initial speech synthesis module to obtain the initial audio; a third loss is determined based on the initial audio and the reference audio, and the parameters of the initial encoder, initial style adapter, and initial speech synthesis module are adjusted according to the third loss to obtain a pre-trained model containing a pre-trained audio encoder, a pre-trained style adapter, and a pre-trained speech synthesis module.
[0082] For example, the process of training the initial training model to obtain a pre-trained model can be considered as the first training phase (i.e., the pre-training phase). In the first training phase, input audio samples are used to train the initial audio encoder, the initial style adaptor, and the initial speech synthesis module VITS. The first text encoder is the text encoder part of VITS. In this phase, the Q-former and the second text encoder do not participate in the training. The purpose of this training phase is to initialize the above modules and guide the model to converge better.
[0083] The training process is as follows: Audio samples are processed by an audio encoder to output speech embeddings, which are then fed into the style adapter module. The speech features output by the style adapter module are fed into the VITS Stochastic DurationPredictor, flow, and Posterior Encoder, respectively. The audio-recognized text is encoded by the VITS text encoder and then sequentially fed into the MAS alignment layer, flow, and Posterior Encoder. The pre-extracted audio linear spectrum is input into the Posterior Encoder, and finally fed into the Decoder layer to output the initial audio. Then, a third loss is calculated based on the initial audio and the reference audio. The parameters of the model to be trained are adjusted according to the third loss to obtain the pre-trained model. The third loss can be calculated in the following way:
[0084]
[0085] in, As the third loss, The loss is the Mel spectrum reconstruction loss. This is the relative entropy (Kull back-Leibler, KL). This is the loss for the random duration predictor. For the least squares loss function of adversarial training, This is the feature matching loss for the generator. After training is complete, the parameters of each of the above modules are frozen for subsequent second-stage model training. During pre-training, the parameters of the style adaptor, VITS, and audio encoder are adjusted.
[0086] Step 202: Extract features from the encoded features using the model to be trained to obtain audio features, recognition text features, and description text features.
[0087] In some examples, the encoded features can be extracted using the Q-former of the model to be trained, resulting in audio features, recognized text features, and descriptive text features. Specifically, the Q-former is used to perform self-attention calculations on the outputs of the audio encoder, the first text encoder, and the second text encoder, and to perform modal alignment and relevance calculations between the outputs of different encoders.
[0088] For example, the model to be trained includes a pre-trained model, a pre-trained text encoder, and multiple query transformers to be trained; wherein any query transformer includes a group attention unit, a cross attention unit, a feature selection unit, and a normalization unit. The number of query transformers to be trained can be multiple, for example, a first query transformer to be trained that processes audio encoding features, a third query transformer to be trained that processes recognition text encoding features, and a second query transformer to be trained that processes description text encoding features.
[0089] In some examples, feature extraction is performed on the encoded features through a query transformer to obtain audio features, recognition text features, and descriptive text features. This includes: extracting audio features from the encoded audio features using a first query transformer of the model to be trained; extracting descriptive text features from the encoded descriptive text features using a second query transformer of the model to be trained; and extracting recognition text features from the encoded descriptive text features using a third query transformer of the model to be trained.
[0090] In some examples, the model under test extracts features from the encoded features to obtain audio features, recognition text features, and descriptive text features. This includes: calculating attention weights on the encoded features through a group attention unit to obtain a first feature sequence; calculating attention weights on the first feature sequence through a cross attention unit to obtain a second feature sequence; filtering the second feature sequence through a feature filtering unit to obtain a third feature sequence; and normalizing the third feature sequence through a normalization unit to obtain audio features, recognition text features, and descriptive text features.
[0091] Further, attention weights are calculated on the audio encoded features using grouped attention units to obtain a first feature sequence, including: performing vector transformation on the query vector using grouped attention units to obtain a first grouped query matrix, a first grouped key matrix, and a first grouped value matrix; wherein the query vector is used to extract semantics related to the text content from the audio features included in the encoded features; splitting the first grouped query matrix, the first grouped key matrix, and the first grouped value matrix to obtain a first sub-grouped query matrix, a first sub-grouped key matrix, and a first sub-grouped value matrix, and using the first sub-grouped query matrix as the first query group, and the first sub-grouped key matrix and the first sub-grouped value matrix as the first key-value group; wherein any first key-value group corresponds to multiple first query groups; calculating attention weights on the first query groups to obtain a first weight matrix; assigning weights to the first sub-grouped value matrices in the first key-value groups corresponding to the first query groups based on the first weight matrix to obtain weighted first sub-grouped value matrices; and concatenating the weighted first sub-grouped value matrices to obtain a first feature sequence corresponding to the audio encoded features.
[0092] Furthermore, the second feature sequence is obtained by calculating attention weights on the first feature sequence through a cross-attention unit, including: converting the first feature sequence into a cross-query matrix and converting the encoded features into a cross-key matrix and a cross-value matrix through the cross-attention unit; splitting the cross-query matrix, cross-key matrix, and cross-value matrix to obtain multiple attention heads; wherein each attention head corresponds to a second sub-cross-query matrix, a second sub-cross-key matrix, and a second sub-cross-value matrix; performing feature filtering based on the attention heads to obtain a masking matrix, and calculating attention weights based on the masking matrix to obtain a third weight matrix; assigning weights to the second sub-cross-value matrix based on the third weight matrix to obtain a weighted second sub-cross-value matrix; and concatenating the weighted second sub-cross-value matrices to obtain the second feature sequence corresponding to the audio encoding features, the recognized text encoding features, or the descriptive text encoding features.
[0093] Furthermore, the first feature sequence is obtained by calculating attention weights for the identified text features and / or descriptive text features through a grouped attention unit, including: performing vector transformation on the identified text features and / or descriptive text features through the grouped attention unit to obtain a second grouped query matrix, a second grouped key matrix, and a second grouped value matrix; splitting the second grouped query matrix, the second grouped key matrix, and the second grouped value matrix to obtain a second sub-grouped query matrix, a second sub-grouped key matrix, and a second sub-grouped value matrix, and using the second sub-grouped query matrix as a second query group, and the second sub-grouped key matrix and the second sub-grouped value matrix as a second key value group; wherein any second key value group corresponds to multiple second query groups; calculating attention weights through the second query groups to obtain a second weight matrix; assigning weights to the second sub-grouped value matrices in the second key value groups corresponding to the second query groups based on the second weight matrix to obtain weighted second sub-grouped value matrices; and concatenating the weighted second sub-grouped value matrices to obtain the first feature sequence corresponding to the identified text encoding features or descriptive text encoding features.
[0094] For example, such as Figure 3As shown in the structural diagram of the Q-former, the first Q-former extracts the corresponding Q-embeddedness from the speech embedding by training a set of learnable query vectors, Q-queries. Here, the query vectors, Q-queries, refer to the set of vectors set in the Q-former, used to extract abstract semantics related to the text from the speech embedding features. The Q-queries are initially a set of random learnable parameters (or pre-trained model parameters similar to BERT base), and through model training, they are made to dynamically adapt to the semantic structure of the speech features.
[0095] After obtaining the audio encoding features as described above, attention is calculated using the first Q-former to obtain the audio features. The operations performed by the group query attention unit, dynamic sparse cross attention unit, hybrid gating mechanism unit, and adaptive normalization layer in the first Q-former are as follows:
[0096] (1) Grouped query attention unit:
[0097] As shown above, the grouped query attention unit consists of multiple stacked grouped query attention blocks. The working process of each grouped query attention block is as follows:
[0098] Using Q-queries as input to the grouped query attention unit, the Q-queries are first transformed into corresponding query (Q) matrices, key (K) matrices, and value (V) matrices through multiple linear projection layers. Then, each of these matrices is further divided into m Q-submatrices, n K-submatrices, and n V-submatrices, where m is greater than n and is an integer multiple of n. Thus, the m Q-submatrices form m query groups, and the n K-submatrices and V-submatrices form n key-value groups. Within each key-value group, there is a one-to-one correspondence between the K-submatrices and V-submatrices, and one key-value group corresponds to multiple (m / n) query groups.
[0099] Each query group independently performs attention weight calculation (multiple query groups corresponding to the same key-value group share K and V within that key-value group during the calculation process), outputting m weight matrices. These weight matrices are then used to assign weights to the V submatrices within the key-value group corresponding to that query group, resulting in m weighted V submatrices.
[0100] The V sub-matrices after the above m weight assignments are concatenated to obtain the output sequence of the grouped query attention block. This output sequence is then input into the next layer of grouped query attention blocks, and the above operation is repeated. After iterative calculation through multiple layers of grouped query attention blocks, the final output feature sequence is the first feature sequence output by the grouped query attention unit. This method significantly reduces the KV cache memory usage during attention calculation, thereby accelerating model training.
[0101] (2) Dynamic sparse cross-attention unit:
[0102] Similarly, the dynamic sparse cross-attention unit comprises multiple stacked dynamic sparse attention blocks, and the working process of each dynamic sparse attention block is as follows:
[0103] After the first feature sequence is input, it undergoes self-attention processing through a first self-attention layer. Following this processing, it is converted into the corresponding Q-matrix through a linear projection layer. Simultaneously, the Speechembedding output from the audio encoder is converted into K and V matrices through the linear projection layer. The QKV matrix is then split into multiple attention heads according to the feature dimension, with each attention head corresponding to a Q-submatrix, a K-submatrix, and a V-submatrix. The attention weights are calculated independently by each attention head.
[0104] Within each attention head, a masking matrix is first calculated. This masking matrix determines whether the i-th token meets preset conditions relative to the j-th token during subsequent attention calculations. Tokens meeting these conditions are given attention, while those not meeting the conditions are ignored. These preset rules include: 1) Calculating the similarity between the Q-submatrix and K-submatrix within each attention head. The masking matrices for multiple attention heads are calculated independently. Specifically, the dot product between the Q-submatrix and K-submatrix is calculated, and the result represents the similarity between corresponding tokens in the two submatrixes. For a given token, tokens with similarity values greater than or equal to a preset threshold are marked as attention tokens, and those with similarity values less than the preset threshold are marked as ignore tokens. 2) Scoring each token using a preset Multilayer Perceptron (MLP) network. For a given token, tokens with scores greater than or equal to a preset threshold are marked as attention tokens, and those with scores less than the preset threshold are marked as ignore tokens. The token scores mentioned above depend on the initial training of the MLP network. During training, the MLP network learns the importance of a particular token to other tokens in the training samples. 3) All tokens are clustered using a pre-defined BERT network to form a predetermined number of token classes, each containing several tokens. For a given token, only tokens belonging to the same token class are marked as attention tokens, while other tokens are marked as ignore tokens. It should be noted that the calculations in methods 2) and 3) above do not involve the Q and K sub-matrices; they can be calculated directly using the original tokens (i.e., the speech embedding or the output of the previous attention block).
[0105] Based on the above calculations, a masking matrix can be output. This masking matrix represents the importance of the i-th token relative to the j-th token in the token sequence, i.e., whether it is noticed or ignored. Notified tokens are included in subsequent attention calculations, while ignored tokens are not. In the masking matrix, notified tokens are marked as 1, and ignored tokens are marked as 0.
[0106] After the masking matrix is calculated, dynamic sparse attention can be calculated, with multiple attention heads independently calculating attention weights. During the calculation, based on the masking matrix, attention weights are calculated only for the attention tokens, and the corresponding weight matrix is output. The weight matrix is then used to assign weights to the corresponding V sub-matrices to obtain the weighted V sub-matrices. Multiple attention heads output multiple weighted V sub-matrices respectively.
[0107] The V submatrix after the above weight allocation is concatenated to obtain the output sequence of the dynamic sparse attention block. The output sequence is then input into the next layer of dynamic sparse attention block, and the above operation is repeated.
[0108] It should be noted that in the next layer of dynamic sparse attention blocks, the first feature sequence (or the corresponding sequence in this layer) after self-attention processing is used as the query sequence. This process is repeated, and after iterative calculations through multiple layers of dynamic sparse attention blocks, the final output feature sequence is the second feature sequence finally output by the dynamic sparse cross-attention unit.
[0109] By employing the above method, each query focuses only on the key-value pairs with the highest similarity, further reducing computational complexity. From the computational results, the application of this cross-attention mechanism enables the retrieval of features related to Q-queries from the speech embedding, ensuring that the final output sequence exhibits a strong correlation with the text.
[0110] (3) Hybrid gating mechanism unit:
[0111] It includes GLU gated units and depthwise separable convolutions, which are used to perform forward inference on the second feature sequence through a hybrid gating mechanism to obtain the third feature sequence, which is used to capture local features of audio samples.
[0112] (4) Adaptive normalization layer:
[0113] The third feature sequence is then adaptively normalized to output the aforementioned speech feature Q-Embedding. Here, Q-Embedding represents the features in the speech embedding related to the query vector Q-queries.
[0114] It should be noted that the first and second text encoders respectively set up a second Q-Former and a third Q-Former. The overall structure and principle of the second and third Q-Formers are basically the same as those of the first Q-Former. The difference is that in the second and third Q-Formers, the input objects are the text embedding output by the first text encoder and the style embedding output by the second text encoder, respectively. The outputs are T-Embedding representing the transcribed text features and S-Embedding representing the style text features, respectively. In the grouped query attention unit corresponding to the second and third Q-Formers, attention is calculated directly using the text embedding / style embedding as the query, key, and value, instead of using the query vector. Correspondingly, the subsequent dynamic sparse attention unit adopts a self-attention mechanism, also directly using the text embedding / style embedding as the query, key, and value for attention calculation, and the query vector is no longer used as the query value. The other principles can be referred to the first Q-Former, so they will not be repeated here in this embodiment.
[0115] Step 203: Determine the first loss based on the audio features and the recognized text features, and determine the second loss based on the audio features and the descriptive text features.
[0116] For example, after outputting the above Q-Embedding, T-Embedding, and S-Embedding, in order to achieve the aforementioned decoupling of audio text and style, this embodiment minimizes the correlation between Q-Embedding and T-Embedding and maximizes the correlation between Q-Embedding and S-Embedding. By doing so, the correlation between audio and transcribed text is weakened, while the correlation between audio and style text is strengthened. This decouples transcribed text and style text during training, thereby enabling a more ideal speech synthesis effect when style is described using natural language in the subsequent model inference process.
[0117] In some examples, determining the first loss based on audio features and recognized text features includes: calculating conditional probabilities based on audio features and recognized text features; calculating log ratios based on conditional probabilities and summing the log ratios to obtain a ratio sum; and determining the first loss based on the ratio sum and the number of features of the audio features and recognized text features.
[0118] For example, using mutual information upper bound As the first loss, the Q-Embedding is evaluated (recorded during the calculation process as follows). ) and T-Embedding (referred to as T-Embedding in the calculation process) The correlation between the two is calculated using the following formula:
[0119]
[0120] in, and It means that given the first indivual The first sample The and the first indivual The conditional probability of the sample. The log ratio captures the conditional probability of the sample. conditional The difference between them, and summing over all logarithmic ratios, provides and The upper bound of mutual information is measured between them to complete the upper bound of mutual information. ( ; The calculation of ). During model training, ( ; As part of the model loss function, it is used during model training to... ( ; This is minimized to reduce the correlation between the two. The principle behind the above calculation is that... and The actual minimum mutual information between them, I( ; Since the mutual information is difficult to calculate, we choose to calculate the corresponding upper bound here. ( ; Since the above two satisfy I( ; )≤U( ; Therefore, the upper realm ( ; The smaller the value of I, the greater the mutual information. ; The smaller the value, the better.
[0121] In some examples, based on the above audio sample set including real audio sample set and / or virtual audio sample set, the audio description text can be classified according to emotion tag to obtain first-level audio description text; the first-level audio description text can be classified according to emotion sub-tag under emotion tag to obtain second-level audio description text; and a positive sample set, a first negative sample set and a second negative sample set can be generated based on emotion tag, emotion sub-tag, first-level audio description text and second-level audio description text.
[0122] For example, the positive sample set includes the secondary audio description text under the emotion sub-label of any emotion label, the first negative sample set includes the secondary audio description text under other emotion sub-labels of any emotion label, and the second negative sample set includes the primary audio description text under other emotion labels.
[0123] For example, based on the emotion tags mentioned above, the audio sample set is divided into N independent primary categories, each of which includes K secondary categories. The primary and secondary categories correspond to different style categories.
[0124] It should be noted that the primary and secondary categories mentioned above do not have a strict hierarchical relationship. They can be combined based on the relevance of emotions. Taking happiness as an example, happiness can form complex emotions with other emotions, such as happiness and excitement, happiness but anxiety, happiness but bittersweetness, etc. In this embodiment, happiness can be used as the primary category, and happiness and excitement, happiness but anxiety, and happiness but bittersweetness as secondary categories. In each training step, one training sample is selected from the secondary category corresponding to each primary category, i.e., N samples are selected. K sets of training samples are used for training. N In the K groups, there is one positive sample (called ), K-1 style-approximate negative samples (called (from other secondary categories in the same primary category as the positive sample), and (N-1). K irrelevant negative samples of different styles (called K) (From other primary categories).
[0125] In some examples, the positive sample loss is determined based on the audio features and the descriptive text features of the secondary audio description text in the positive sample set; the first negative sample loss is determined based on the audio features and the descriptive text features of the secondary audio description text in the first negative sample set; and the second negative sample loss is determined based on the audio features and the descriptive text features of the primary audio description text in the second negative sample set. The sum of the positive sample loss, the first negative sample loss, and the second negative sample loss is used as the second loss.
[0126] In the actual training process, the training audio of positive samples and the corresponding style text are used as the input of positive samples, while the training audio of positive samples and the style text of negative samples are used as the input of negative samples. For the latter, the model is instructed to learn the mismatch between audio and style text in order to enhance the model's ability to distinguish between different styles.
[0127] For example, assuming there are 800 training samples, they are first divided into 8 primary categories based on labels, i.e., N=8. Then, considering potential complex sentiment, each primary category is further divided into 4 secondary categories, i.e., K=4. In each subsequent training step, 8 samples are selected... There are 32 sets of training data, of which 1 set is a positive sample, 3 sets are approximately negative samples, and the remaining 28 sets are irrelevant negative samples.
[0128] Using cosine similarity To measure Q-Embedding ( ) and S-Embedding ( The distance between them is shown in the following formula:
[0129] Among them, the weighting coefficient and Control the contribution of each term in the loss function. Threshold. It controls speech features Style features unrelated to speech text description The margin of distance between them. It is a positive sample style text for style matching and The cosine similarity between them is such that the larger the value, the more similar they are. It is a negative sample style text and The cosine similarity between them, the smaller the value, the less similar they are; It is a negative sample style text with a different style and The cosine similarity between them is the smaller the value, the less similar they are; ReLU is used to... The penalty applied when similarity is high is to add 'm' to indirectly increase the distance between positive and negative samples. The goal is to make negative samples with different styles as far away from positive samples as possible, so that they can be clearly distinguished. Overall... The smaller the loss, the better.
[0130] In this way, on the one hand, the model’s ability to identify and distinguish styles is further enhanced by the comparative learning of positive and negative samples. On the other hand, by setting up and calculating similar negative samples and irrelevant negative samples separately, the adverse effects of similar styles carried by negative samples in comparative learning are effectively reduced, ensuring that the speech styles between each category show significant and substantial differences, thereby significantly enhancing the final style generation effect of the model.
[0131] Building upon this, considering that style is influenced not only by the text but also by the speaker's own vocal characteristics, the speaker's characteristics will also affect the final effect of style generation during the process of generating style language using natural language. Therefore, this embodiment further decouples the audio from the speaker's vocal characteristics during training. The process is as follows: Training audio samples input to the audio encoder are synchronously input into the speaker encoder for encoding, extracting the corresponding speaker embedding. The speaker embedding is used to represent features in the audio related to the speaker's voice, such as male or female voice, age, etc. The speaker encoder is composed of multiple residual convolutional networks (Residual Blocks, Res Blocks) as the front end and a time-delay neural network (TDNN) structure as the backbone. The speaker embedding and the speech feature speech embedding output by the aforementioned vits' Posterior Encoder are fed into the discriminator for discrimination, determining whether the speech embedding corresponding to the current audio is related to the speaker embedding. The model is adjusted through adversarial training to gradually reduce the correlation between the speech embedding and the speaker embedding, thereby achieving decoupling from the speaker.
[0132] Step 204: Adjust the parameters of the model to be trained based on the first loss, the second loss, and the third loss of the pre-trained model to obtain the audio generation model.
[0133] The audio generation model trained in this embodiment can be used to generate target audio based on the input audio content text and audio style description, wherein the audio style of the target audio is consistent with the descriptive style of the audio style description.
[0134] In some examples, during the parameter tuning of the pre-trained model based on the first loss and the second loss, the sum of the first loss and the second loss can be calculated, and the parameters of the pre-trained model can be tuned based on the sum of the first loss and the second loss to obtain the audio generation model.
[0135] For example, the first loss and the second loss are respectively the aforementioned upper bound and lower bound of mutual information. and Combining the two together :
[0136]
[0137] Among them, the weighting coefficient and control and . contributions.
[0138] Furthermore, the parameters of the pre-trained model can be adjusted based on the sum of the third loss, the first loss, and the second loss from the pre-training stage to obtain an audio generation model.
[0139] For example, in the above calculations, Based on this, the total loss is calculated according to the following formula:
[0140]
[0141] in This is a loss for VITS.
[0142] In this embodiment, during the training process of the audio generation model, an audio sample set is encoded using a pre-trained model to obtain encoded features. The audio sample set includes audio samples, audio recognition text, and audio description text. The audio description text describes the audio style of the audio samples. The encoded features are extracted using the model to be trained to obtain audio features, recognition text features, and description text features. A first loss is determined based on the audio features and recognition text features, and a second loss is determined based on the audio features and description text features. The parameters of the model to be trained are adjusted based on the first loss, the second loss, and the third loss of the pre-trained model to obtain the audio generation model. The audio generation model trained in this way can decouple the text and style parts of the audio as much as possible, thereby enabling the model to more accurately learn the correspondence between style description and audio style.
[0143] Figure 4 This is a flowchart illustrating a training method for an audio generation model provided in an embodiment of this application. Figure 4 As shown, the method includes steps 401 to 402.
[0144] Step 401: Obtain the audio content text and audio style description.
[0145] Step 402: Based on the audio content text and audio style description, generate the target audio using an audio generation model.
[0146] For example, the audio style of the target audio is consistent with the descriptive style of the audio style description; wherein, the audio generation model is trained using the training method of the audio generation model in the previous item.
[0147] After training an audio generation model, audio can be generated using this model. For example, audio content text and an audio style description can be obtained. These can be input into the trained audio generation model (at this point, the HuberT audio encoder is no longer needed), allowing the model to directly output audio with a style consistent with the natural language description.
[0148] The workflow diagram of the audio generation model is as follows: Figure 5 As shown, the natural language description refers to the descriptive text used to generate the audio, and the text to be synthesized refers to the content text used to generate the audio. The natural language description is encoded by a text encoder to obtain the text encoding, and the obtained text encoding is sent to the query transformer, style adapter and speech synthesis module to generate the output audio that matches the user description.
[0149] Corresponding to the aforementioned embodiments of the training method for the audio generation model, this application also provides embodiments of a training apparatus for the audio generation model. Figure 6 A task processing apparatus provided in the embodiments of this application, such as Figure 6 As shown, the task processing device 600 includes an encoding module 601, a feature extraction module 602, a loss calculation module 603, and a parameter adjustment module 604.
[0150] The encoding module 601 is used to encode the audio sample set through a pre-trained model to obtain encoded features; the audio sample set includes audio samples, audio recognition text, and audio description text; the audio description text is used to describe the audio style of the audio samples.
[0151] The feature extraction module 602 is used to extract features from the encoded features through the model to be trained, thereby obtaining audio features, recognition text features, and description text features;
[0152] The loss calculation module 603 is used to determine a first loss based on audio features and recognized text features, and to determine a second loss based on audio features and descriptive text features;
[0153] The parameter adjustment module 604 is used to adjust the parameters of the model to be trained based on the first loss, the second loss, and the third loss of the pre-trained model to obtain the audio generation model.
[0154] In another possible implementation, the feature extraction module 602 is used to encode audio samples using an initial audio encoder to obtain initial audio features, and to encode audio recognition text using a prior encoder of an initial speech synthesis module to obtain initial recognition text features; to perform feature transformation on the initial audio features using an initial style adapter to obtain feature transformation results; to perform speech synthesis using an initial speech synthesis module based on the audio linear spectrum extracted from the audio samples, the feature transformation results, and the initial recognition text features to obtain initial audio; to determine a third loss based on the initial audio and the reference audio, and to adjust the parameters of the initial encoder, the initial style adapter, and the initial speech synthesis module according to the third loss to obtain a pre-trained model containing a pre-trained audio encoder, a pre-trained style adapter, and a pre-trained speech synthesis module.
[0155] In another possible implementation, the model to be trained includes a pre-trained model, a pre-trained text encoder, and multiple query transformers to be trained; wherein each query transformer includes a group attention unit, a cross attention unit, a feature filtering unit, and a normalization unit; the feature extraction module 602 is used to calculate the attention weights of the encoded features through the group attention unit to obtain a first feature sequence; calculate the attention weights of the first feature sequence through the cross attention unit to obtain a second feature sequence; filter the second feature sequence through the feature filtering unit to obtain a third feature sequence; and normalize the third feature sequence through the normalization unit to obtain audio features, recognized text features, and descriptive text features.
[0156] In another possible implementation, the feature extraction module 602 is used to perform vector transformation on the query vector through a grouped attention unit to obtain a first grouped query matrix, a first grouped key matrix, and a first grouped value matrix; wherein the query vector is used to extract semantics related to the text content from the audio features included in the encoded features; the first grouped query matrix, the first grouped key matrix, and the first grouped value matrix are split to obtain a first sub-grouped query matrix, a first sub-grouped key matrix, and a first sub-grouped value matrix, and the first sub-grouped query matrix is used as a first query group, and the first sub-grouped key matrix and the first sub-grouped value matrix are used as a first key-value group; wherein any first key-value group corresponds to multiple first query groups; attention weights are calculated through the first query groups to obtain a first weight matrix, and the first sub-grouped value matrices in the first key-value groups corresponding to the first query groups are weighted based on the first weight matrix to obtain weighted first sub-grouped value matrices; the weighted first sub-grouped value matrices are concatenated to obtain a first feature sequence corresponding to the audio encoded features.
[0157] In another possible implementation, the feature extraction module 602 is used to perform vector transformation on the identified text features and / or descriptive text features through a grouped attention unit to obtain a second grouped query matrix, a second grouped key matrix, and a second grouped value matrix; split the second grouped query matrix, the second grouped key matrix, and the second grouped value matrix to obtain a second sub-grouped query matrix, a second sub-grouped key matrix, and a second sub-grouped value matrix, and use the second sub-grouped query matrix as a second query group, and the second sub-grouped key matrix and the second sub-grouped value matrix as a second key-value group; wherein any second key-value group corresponds to multiple second query groups; perform attention weight calculation on the second query groups to obtain a second weight matrix, and perform weight allocation on the second sub-grouped value matrices in the second key-value groups corresponding to the second query groups based on the second weight matrix to obtain weighted second sub-grouped value matrices; and concatenate the weighted second sub-grouped value matrices to obtain a first feature sequence corresponding to the identified text encoding features or the descriptive text encoding features.
[0158] In another possible implementation, the feature extraction module 602 is used to convert the first feature sequence into a cross-query matrix and the encoded features into a cross-key matrix and a cross-value matrix through a cross-attention unit; split the cross-query matrix, cross-key matrix, and cross-value matrix to obtain multiple attention heads; wherein each attention head corresponds to a second sub-cross-query matrix, a second sub-cross-key matrix, and a second sub-cross-value matrix; perform feature filtering based on the attention heads to obtain a masking matrix, and calculate attention weights based on the masking matrix to obtain a third weight matrix; perform weight allocation on the second sub-cross-value matrix based on the third weight matrix to obtain a weighted second sub-cross-value matrix, and concatenate the weighted second sub-cross-value matrices to obtain a second feature sequence corresponding to the audio encoding features, the recognized text encoding features, or the descriptive text encoding features.
[0159] In another possible implementation, the loss calculation module 603 is used to calculate the conditional probability based on the audio features and the recognized text features; calculate the log ratio based on the conditional probability, and sum the log ratios to obtain the ratio sum; and determine the first loss based on the ratio sum and the number of features of the audio features and the recognized text features.
[0160] In another possible implementation, the method further includes: acquiring dialogue audio from multiple users; the dialogue audio includes emotion tags; identifying the first style description text corresponding to the dialogue audio through an audio understanding model; using the dialogue audio as a real audio sample in the audio sample, and using the first style description text as the real audio description text in the audio description text.
[0161] In another possible implementation, the method further includes: acquiring dialogue text containing emotion tags and copying the dialogue audio to obtain copied audio; generating audio based on the dialogue text, dialogue audio, copied audio, and emotion tags to obtain generated audio samples; identifying the second style description text corresponding to the generated audio samples through an audio understanding model, using the generated audio samples as virtual audio samples in the audio samples, and using the second style description text as virtual audio description text in the audio description text.
[0162] In another possible implementation, the method further includes classifying the audio description text according to the text type to obtain the first-level audio description text; classifying the first-level audio description text according to the text subtype of the text type to obtain the second-level audio description text; and generating a positive sample set, a first negative sample set, and a second negative sample set based on the text type, text subtype, first-level audio description text, and second-level audio description text. The positive sample set includes the second-level audio description text under any text subtype of the text type, the first negative sample set includes the second-level audio description text under other text subtypes of any text type, and the second negative sample set includes the first-level audio description text under other text types.
[0163] In another possible implementation, the method further includes determining a positive sample loss based on audio features and the descriptive text features of the secondary audio description text in the positive sample set; determining a first negative sample loss based on audio features and the descriptive text features of the secondary audio description text in the first negative sample set; and determining a second negative sample loss based on audio features and the descriptive text features of the primary audio description text in the second negative sample set; and using the sum of the positive sample loss, the first negative sample loss, and the second negative sample loss as the second loss.
[0164] In another possible implementation, the encoding module 601 is used to encode audio samples using a pre-trained audio encoder of a pre-trained model to obtain audio encoding features, to encode audio description text using a pre-trained text encoder to obtain recognition text encoding features, and to encode audio recognition text using a pre-trained speech synthesis module of a pre-trained model to obtain description text encoding features.
[0165] In another possible implementation, the feature extraction module 602 is used to extract audio coding features through the first query transformer of the model to be trained to obtain audio features, extract text coding features through the second query transformer of the model to be trained to obtain descriptive text features, and extract descriptive text coding features through the third query transformer of the model to be trained to obtain descriptive text features.
[0166] The beneficial technical effects corresponding to the above-described exemplary embodiment of the task processing device 600 can be found in the corresponding beneficial technical effects in the above-described method embodiment section, and will not be repeated here.
[0167] Corresponding to the aforementioned embodiments of the training method for the audio generation model, this application also provides embodiments of the audio generation apparatus. Figure 7 An audio generation apparatus provided in the embodiments of this application, such as Figure 7 As shown, the audio generation device 700 includes a text acquisition module 701 and an audio generation module 702.
[0168] The text acquisition module 701 is configured to acquire audio content text and audio style description;
[0169] The audio generation module 702 is configured to generate audio based on the audio content text and the audio style description, and obtain the target audio through the audio generation model; the audio style of the target audio is consistent with the descriptive style of the audio style description; wherein, the audio generation model is trained using the training method of any of the above-mentioned audio generation models.
[0170] The beneficial technical effects corresponding to the exemplary embodiment of the audio generation apparatus 700 described above can be found in the corresponding beneficial technical effects in the above method embodiment section, and will not be repeated here.
[0171] Figure 8 This is a schematic diagram of an electronic device provided in some embodiments of this application. In some embodiments, the electronic device may be a server, a terminal device, etc. The electronic device includes a multi-core processor and a memory. The multi-core processor includes multiple processor cores; the memory is configured to store one or more programs. When the one or more programs are executed by the multi-core processor, the multi-core processor implements the master station communication method or the master station module deployment method in the above embodiments.
[0172] like Figure 8 As shown, the electronic device 800 includes a multi-core processor 801 and a memory 802. Exemplarily, the electronic device 800 may also include a communications interface 803 and a communications bus 804.
[0173] The multi-core processor 801, memory 802, and communication interface 803 communicate with each other via communication bus 804. Communication interface 803 is used to communicate with other network elements such as clients or other servers.
[0174] In some examples, the multi-core processor 801 is used to execute program 805, specifically performing the relevant steps in the above-described master station communication method or master station module deployment method embodiments. Specifically, program 805 may include program code, which includes computer-executable instructions.
[0175] For example, the multi-core processor 801 may be a central processing unit (CPU), an application-specific integrated circuit (ASIC), or one or more integrated circuits configured to implement some embodiments of this application. The electronic device 800 may include multi-core processors of the same type, such as one or more CPUs; or multi-core processors of different types, such as one or more CPUs and one or more ASICs.
[0176] In some examples, memory 802 is used to store program 805. Memory 802 may include high-speed RAM memory, and may also include non-volatile memory (NVM), such as at least one disk storage device.
[0177] Specifically, program 805 can be called by multi-core processor 801 to enable electronic device 800 to execute master station communication methods or master station module deployment methods.
[0178] Some embodiments of this application provide a computer-readable storage medium storing at least one executable instruction that, when executed on an electronic device 800, causes the electronic device 800 to perform the master station communication method or the master station module deployment method described above.
[0179] Specifically, the executable instructions can be used to enable the electronic device 800 to perform master station communication methods or master station module deployment methods.
[0180] For example, the computer-readable storage medium can be read-only memory (ROM), random access memory (RAM), compact disc read-only memory (CD-ROM), magnetic tape, floppy disk, and optical data storage device, etc.
[0181] The beneficial effects that the readable storage medium provided in some embodiments of this application can achieve can be referred to the beneficial effects in the corresponding master station communication method or master station module deployment method provided above, and will not be repeated here.
[0182] In addition to the methods, apparatus, and devices described above, embodiments of this application may also provide a computer program product, including computer program instructions, which, when executed by a processor, cause the processor to perform the steps in the training methods of the audio generation models of the various embodiments of this application described in the above method embodiment section.
[0183] Computer program products can be written in any combination of one or more programming languages to perform the operations of the embodiments of this application. The programming languages include object-oriented programming languages such as Java and C++, as well as conventional procedural programming languages such as C or similar languages. The program code can be executed entirely on the user's computing device, partially on the user's computing device, as a standalone software package, partially on the user's computing device and partially on a remote computing device, or entirely on a remote computing device or server.
[0184] Furthermore, embodiments of this application may also be computer-readable storage media storing computer program instructions thereon, which, when executed by a processor, cause the processor to perform the steps in the training methods of the audio generation models of the various embodiments of this application described in the above-described method embodiment section.
[0185] Computer-readable storage media may take the form of any combination of one or more readable media. A readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, but is not limited to, systems, apparatuses, or devices that are electrical, magnetic, optical, electromagnetic, infrared, or semiconductor, or any combination thereof. More specific examples of readable storage media (a non-exhaustive list) include: electrical connections having one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.
[0186] The basic principles of this application have been described above with reference to specific embodiments. However, the advantages, benefits, and effects mentioned in this application are merely examples and not limitations, and should not be considered as essential features of each embodiment of this application. Furthermore, the specific details of the above embodiments are for illustrative and facilitative purposes only, and are not limitations. These details do not restrict this application from being implemented using the aforementioned specific details.
[0187] Those skilled in the art can make various modifications and variations to this application without departing from its spirit and scope. Therefore, if these modifications and variations fall within the scope of the claims and their equivalents, this application also intends to include these modifications and variations. Furthermore, the above embodiments are merely specific embodiments of this application and are not intended to limit the scope of protection of this application. Any modifications, equivalent substitutions, improvements, etc., made based on the technical solution of this application should be included within the scope of protection of this application.
Claims
1. A training method for an audio generation model, characterized in that, The method includes: The audio sample set is encoded using a pre-trained model to obtain encoded features; the audio sample set includes audio samples, audio recognition text, and audio description text; the audio description text is used to describe the audio style of the audio samples. The encoded features are extracted using the model to be trained to obtain audio features, recognized text features, and descriptive text features. A first loss is determined based on the audio features and the recognized text features, and a second loss is determined based on the audio features and the descriptive text features; The parameters of the model to be trained are adjusted based on the first loss, the second loss, and the third loss of the pre-trained model to obtain an audio generation model; wherein the third loss is determined based on the initial audio and the benchmark audio. The step of determining the first loss based on the audio features and the recognized text features includes: Calculate the conditional probability based on the audio features and the recognized text features; Calculate the logarithmic ratio based on the conditional probability, and sum the logarithmic ratios to obtain the ratio sum; The first loss is determined based on the ratio and the number of features of the audio features and the recognized text features; Determining the second loss based on the audio features and the descriptive text features includes: The positive sample loss is determined based on the audio features and the descriptive text features of the secondary audio description text in the positive sample set; the first negative sample loss is determined based on the audio features and the descriptive text features of the secondary audio description text in the first negative sample set; and the second negative sample loss is determined based on the audio features and the descriptive text features of the primary audio description text in the second negative sample set. The sum of the positive sample loss, the first negative sample loss, and the second negative sample loss is taken as the second loss.
2. The training method for the audio generation model according to claim 1, characterized in that, The method further includes: The audio samples are encoded by an initial audio encoder to obtain initial audio features, and the audio recognition text is encoded by a prior encoder of the initial speech synthesis module to obtain initial recognition text features. The initial audio features are transformed using an initial style adapter to obtain the feature transformation result; Based on the audio linear spectrum extracted from the audio samples, the feature transformation results, and the initial recognized text features, speech synthesis is performed through the initial speech synthesis module to obtain the initial audio. The third loss is determined based on the initial audio and the reference audio, and the parameters of the initial audio encoder, the initial style adapter, and the initial speech synthesis module are adjusted according to the third loss to obtain a pre-trained model containing a pre-trained audio encoder, a pre-trained style adapter, and a pre-trained speech synthesis module.
3. The training method for the audio generation model according to claim 1, characterized in that, The model to be trained includes the pre-trained model, the pre-trained text encoder, and multiple query transformers to be trained; wherein, any query transformer includes a group attention unit, a cross attention unit, a feature filtering unit, and a normalization unit; Accordingly, the step of extracting features from the encoded features using the model to be trained to obtain audio features, recognition text features, and description text features includes: The first feature sequence is obtained by calculating the attention weights of the encoded features through the grouped attention unit; The second feature sequence is obtained by calculating the attention weights of the first feature sequence through the cross-attention unit; The second feature sequence is filtered by the feature filtering unit to obtain the third feature sequence; The normalization unit performs normalization processing on the third feature sequence to obtain the audio feature, the recognition text feature, and the description text feature.
4. The training method for the audio generation model according to claim 3, characterized in that, The step of calculating attention weights for the encoded features using the grouped attention unit to obtain the first feature sequence includes: The query vector is transformed using the group attention unit to obtain a first group query matrix, a first group key matrix, and a first group value matrix; wherein the query vector is used to extract semantics related to the text content from the audio features included in the encoded features; The first group query matrix, the first group key matrix, and the first group value matrix are split to obtain a first subgroup query matrix, a first subgroup key matrix, and a first subgroup value matrix. The first subgroup query matrix is taken as a first query group, and the first subgroup key matrix and the first subgroup value matrix are taken as a first key-value group. Each first key-value group corresponds to multiple first query groups. Attention weights are calculated using the first query group to obtain a first weight matrix. Based on the first weight matrix, weights are assigned to the first subgroup value matrices in the first key value group corresponding to the first query group to obtain the weighted first subgroup value matrices. The first subgroup value matrix after weight allocation is concatenated to obtain the first feature sequence corresponding to the audio coding features.
5. The training method for the audio generation model according to claim 3, characterized in that, The step of calculating attention weights for the encoded features using the grouped attention unit to obtain the first feature sequence includes: The grouping attention unit performs vector transformation on the identified text features and / or descriptive text features to obtain the second grouping query matrix, the second grouping key matrix, and the second grouping value matrix. The second grouping query matrix, the second grouping key matrix, and the second grouping value matrix are split to obtain a second subgrouping query matrix, a second subgrouping key matrix, and a second subgrouping value matrix. The second subgrouping query matrix is used as the second query group, and the second subgrouping key matrix and the second subgrouping value matrix are used as the second key-value group. Each second key-value group corresponds to multiple second query groups. The attention weight is calculated through the second query group to obtain the second weight matrix. Based on the second weight matrix, the weights are assigned to the second subgroup value matrices in the second key value group corresponding to the second query group to obtain the weighted second subgroup value matrices. The second subgroup value matrix after weight allocation is concatenated to obtain the first feature sequence corresponding to the recognized text encoding features or the described text encoding features.
6. The training method for the audio generation model according to claim 5, characterized in that, The step of calculating attention weights on the first feature sequence through the cross-attention unit to obtain the second feature sequence includes: The first feature sequence is converted into a cross-query matrix through the cross-attention unit, and the encoded features are converted into a cross-key matrix and a cross-value matrix. The cross query matrix, the cross key matrix, and the cross value matrix are split to obtain multiple attention heads; wherein each attention head corresponds to a second sub-cross query matrix, a second sub-cross key matrix, and a second sub-cross value matrix. Feature filtering is performed based on the attention head to obtain a masking matrix, and attention weights are calculated based on the masking matrix to obtain a third weight matrix; The second sub-cross value matrix is weighted based on the third weight matrix to obtain the weighted second sub-cross value matrix. The weighted second sub-cross value matrix is then concatenated to obtain the second feature sequence corresponding to the audio coding feature, the recognized text coding feature, or the descriptive text coding feature.
7. The training method for the audio generation model according to claim 1, characterized in that, The method further includes: Acquire audio recordings of conversations between multiple users; the audio recordings include emotion tags. The audio understanding model is used to identify the first-style descriptive text corresponding to the dialogue audio; The dialogue audio is used as the real audio sample in the audio sample, and the first style description text is used as the real audio description text in the audio description text.
8. The training method for the audio generation model according to claim 7, characterized in that, The method further includes: Obtain the dialogue text containing emotion tags, and copy the dialogue audio to obtain the copied audio; Audio is generated based on the dialogue text, the dialogue audio, the copied audio, and the emotion tag to obtain a generated audio sample; The second style description text corresponding to the generated audio sample is identified by the audio understanding model, and the generated audio sample is used as a virtual audio sample in the audio sample, and the second style description text is used as the virtual audio description text in the audio description text.
9. The training method for the audio generation model according to claim 8, characterized in that, The method further includes: The audio description text is classified according to the emotion tags to obtain the first-level audio description text; The primary audio description text is classified according to the emotion sub-tags under the emotion tag to obtain the secondary audio description text; The positive sample set, the first negative sample set, and the second negative sample set are generated based on the emotion tag, the emotion sub-tag, the first-level audio description text, and the second-level audio description text. The positive sample set includes secondary audio description text under the emotion sub-tag of any emotion tag, the first negative sample set includes secondary audio description text under other emotion sub-tags of the any emotion tag, and the second negative sample set includes primary audio description text under other emotion tags.
10. The training method for the audio generation model according to claim 1, characterized in that, The process of encoding the audio sample set using a pre-trained model to obtain encoded features includes: The audio samples are encoded using the pre-trained audio encoder of the pre-trained model to obtain audio encoding features. The audio description text is encoded using the pre-trained text encoder to obtain description text encoding features. The audio recognition text is encoded using the pre-trained speech synthesis module of the pre-trained model to obtain recognition text encoding features.
11. The training method for the audio generation model according to claim 10, characterized in that, The step of extracting features from the encoded features using the model to be trained to obtain audio features, recognized text features, and descriptive text features includes: The audio encoding features are extracted using the first query transformer of the model to be trained to obtain the audio features. The descriptive text encoding features are extracted using the second query transformer of the model to be trained to obtain the descriptive text features. The recognition text features are extracted using the third query transformer of the model to be trained to obtain the recognition text features.
12. An audio generation method, characterized in that, The method includes: Retrieve audio content text and audio style description; Based on the audio content text and the audio style description, audio is generated using an audio generation model to obtain the target audio; the audio style of the target audio is consistent with the descriptive style of the audio style description. The audio generation model is trained using the training method for the audio generation model according to any one of claims 1-11.
13. A training device for an audio generation model, characterized in that, The device includes: The encoding module is used to encode the audio sample set through a pre-trained model to obtain encoding features; the audio sample set includes audio samples, audio recognition text, and audio description text; the audio description text is used to describe the audio style of the audio samples; The feature extraction module is used to extract features from the encoded features through the model to be trained, thereby obtaining audio features, recognition text features, and description text features; The loss calculation module is used to determine a first loss based on the audio features and the recognized text features, and to determine a second loss based on the audio features and the descriptive text features; The parameter adjustment module is used to adjust the parameters of the model to be trained based on the first loss, the second loss, and the third loss of the pre-trained model to obtain an audio generation model; wherein the third loss is determined based on the initial audio and the benchmark audio. The loss calculation module is specifically used to calculate the conditional probability based on the audio features and the recognized text features; calculate the log ratio based on the conditional probability, and sum the log ratios to obtain the ratio sum; and determine the first loss based on the ratio sum and the number of features of the audio features and the recognized text features. The loss calculation module is specifically used to determine the positive sample loss based on the audio features and the descriptive text features of the secondary audio description text in the positive sample set, to determine the first negative sample loss based on the audio features and the descriptive text features of the secondary audio description text in the first negative sample set, and to determine the second negative sample loss based on the audio features and the descriptive text features of the primary audio description text in the second negative sample set; and to use the sum of the positive sample loss, the first negative sample loss and the second negative sample loss as the second loss.
14. An audio generation apparatus, characterized in that, The device includes: The text acquisition module is configured to acquire audio content text and audio style description; An audio generation module is configured to generate audio based on the audio content text and the audio style description using an audio generation model to obtain target audio; the audio style of the target audio is consistent with the descriptive style of the audio style description; wherein the audio generation model is trained using the training method of the audio generation model according to any one of claims 1-11.
15. An electronic device, characterized in that, include: One or more processors; and The memory is configured to store one or more programs. When the one or more programs are executed by the one or more processors, the one or more processors implement the training method of the audio generation model according to any one of claims 1-11, or implement the audio generation method according to claim 12.
16. A computer-readable storage medium, characterized in that, It stores a computer program that, when executed by a processor, implements the training method of the audio generation model according to any one of claims 1-11, or implements the audio generation method according to claim 12.