Dialogue generation method and device, equipment, medium, product
By extracting the emotional, prosodic, and acoustic features of user speech and combining them with audio and text context data from multi-turn dialogues, a voice response adapted to the current context is generated. This solves the problems of emotional discontinuity and semantic bias in human-computer interaction in existing technologies, and achieves more humanized emotional feedback and voice dialogue effects.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING XIAOMI MOBILE SOFTWARE CO LTD
- Filing Date
- 2026-03-18
- Publication Date
- 2026-06-23
Smart Images

Figure CN122266352A_ABST
Abstract
Description
Technical Field
[0001] This specification relates to the field of artificial intelligence speech technology, specifically to a dialogue generation method, apparatus, device, medium, and product. Background Technology
[0002] With the development of artificial intelligence (AI) technology, intelligent voice dialogue systems are widely used in scenarios such as in-vehicle voice assistants, smart speakers, and virtual human-computer interaction. Among them, multi-turn voice dialogue is the core interaction form of intelligent voice dialogue systems.
[0003] Currently, intelligent voice assistants have gradually evolved from simple command-based question-and-answer dialogues to a more human-like chat-style interaction mode. User needs are no longer limited to basic demands such as information retrieval and function response, but have further raised higher-level experience requirements such as emotional guidance and emotional companionship. Users expect voice replies to fit the communication scenario and their own state, achieving a more natural and warmer human-computer voice interaction. Summary of the Invention
[0004] To improve the voice quality of human-computer dialogue, this specification provides a dialogue generation method and apparatus, electronic device, storage medium, and computer program product.
[0005] Firstly, the embodiments of this specification provide a dialogue generation method, including: In the current round of a multi-turn voice dialogue, generate the first text data to be replied to; Acquire user input data for a preset round of dialogue. The user input data includes the original audio data input by the user and the input text data obtained by converting the original audio data into text. The preset round of dialogue includes the current round and several rounds of dialogue before it. The target speech features are obtained by extracting features from the user input data of the preset round-robin dialogue, and the target speech data is generated based on the target speech features and the first text data. Play the target audio data.
[0006] In the embodiments described in this specification, user voice features are extracted by combining the original audio data of multi-turn dialogues, thereby accurately analyzing the current user's emotions, speech rhythm, acoustic environment and other features, preserving the refined paralinguistic information that cannot be represented by text semantics, generating voice responses that are adapted to the current context, providing users with more humanized emotional feedback, and thus improving the effect of voice dialogue.
[0007] In some possible implementations, in the current round of a multi-turn voice dialogue, first text data to be replied to is generated, including: In the current round of a multi-round voice dialogue, the first text data to be replied to in the current round is generated based on the input text data of the current round and the previous rounds of dialogue.
[0008] The first text data is predicted based on pre-set multi-turn input text data. It can accurately capture the semantic logic of multi-turn dialogue by combining text context, ensuring semantic coherence and contextual relevance. Moreover, the model has a single input dimension, high computational efficiency, and is suitable for electronic devices with low computing power or high response speed requirements.
[0009] In some possible implementations, in the current round of a multi-round voice dialogue, the first text data to be replied to in the current round is generated based on the user input data of the preset round dialogue.
[0010] The first text data is predicted by combining the input text data from multiple rounds and the original audio data. By fusing audio context information, non-semantic paralinguistic information can be extracted, so that the generated data content not only conforms to semantic logic, but also adapts to the current dialogue context, providing a data foundation for the subsequent accurate extraction of speech features.
[0011] In some possible implementations, the target speech features include the user's emotional features, prosodic features, and acoustic environment features.
[0012] In some possible implementations, feature extraction is performed on user input data from a preset round-robin dialogue to obtain target speech features, and target speech data is generated based on the target speech features and the first text data, including: The user input data of the preset round-robin dialogue and the first text data are feature-encoded to obtain a feature sequence, and the encoded feature sequence is concatenated to obtain the input feature sequence; The input feature sequence is input into a pre-trained dialogue model, which extracts features from the input feature sequence to obtain the target speech features, and predicts and generates the target speech data based on the target speech features.
[0013] In some possible implementations, feature encoding is performed on the user input data of the preset round-robin dialogue and the first text data to obtain a feature sequence, and the encoded feature sequence is concatenated to obtain the input feature sequence, including: The audio encoding module encodes the original audio data included in the user input data to obtain an audio feature sequence, and the text encoding module encodes the input text data and the first text data included in the user input data to obtain a text feature sequence. The audio feature sequence and the text feature sequence are concatenated by the sequence concatenation module to obtain the input feature sequence.
[0014] By encoding and concatenating the text and audio data of the pre-set round dialogue, the orderly fusion of audio and text features is achieved. The dialogue model can accurately extract the target speech features that integrate text semantics and audio paralinguistic features, accurately capture key features such as user emotions and acoustic environment, make the target speech data output by the model more in line with the current context, and reduce model complexity and training cost.
[0015] In some possible implementations, the input feature sequence is input into a pre-trained dialogue model, which extracts features from the input feature sequence to obtain the target speech features, and predicts and generates the target speech data based on the target speech features, including: The input feature sequence is input into a pre-trained dialogue model, and the dialogue model extracts features from the input feature sequence to obtain the target speech features and the second text data; The dialogue model simultaneously extracts target speech features and second text data from the input feature sequence. It uses the second text data to guide the generation of target speech data. The second text data can form a secondary constraint and guidance at the semantic level for speech generation, which can not only ensure that the target speech features fit the user's emotions and acoustic environment, but also make the generated target speech data maintain logical coherence in semantics and avoid semantic deviation.
[0016] The target speech data is generated based on the second text data and the target speech feature prediction.
[0017] In some possible implementations, the training process of the dialogue model includes: Obtain a sample dataset, which includes multiple sample data, each of which includes text and voice data from a multi-turn dialogue; For each sample data, the text data and speech data of the preset round dialogue are input into the dialogue model to be trained to obtain the predicted text of the current round dialogue output by the dialogue model. Based on the loss between the predicted text and the text labels of the current round of dialogue included in the sample data, the network parameters of the dialogue model are adjusted to obtain an intermediate model. Input the text and voice data of the preset turn-based dialogue into the intermediate model to obtain the predicted text and predicted audio of the current turn-based dialogue output by the intermediate model. Based on the first loss between the predicted text and the text label, and the second loss between the predicted audio and the speech data of the preset round dialogue, the network parameters of the dialogue model are adjusted until the model converges, thus obtaining the trained dialogue model.
[0018] In the implementation method described in this specification, a two-stage progressive model training process is employed. First, based on text prediction loss, the model fully understands the speech input and establishes a relational representation between audio and text, improving the model's semantic understanding ability and ensuring logical consistency in multi-turn dialogues. Then, the text and audio prediction losses are jointly calculated, allowing the model to learn speech generation capabilities while retaining strong semantic understanding capabilities. This enables the model to accurately capture the semantic relationship between speech and text, generate context-appropriate response speech, and thus improve the effectiveness of voice dialogue.
[0019] In some possible implementations, playing the target voice data includes: The target speech data is decoded to obtain the target audio, and the target audio is waveform converted to obtain an audio waveform signal. The speech is then played according to the audio waveform signal.
[0020] By using audio decoding and vocoder audio restoration, an end-to-end conversion from target speech data generated by the model to speech playback is achieved, preserving the emotion, rhythm and other features of the target speech data, and improving the human-computer interaction experience.
[0021] Secondly, embodiments of this specification provide a dialogue generation apparatus, including: The text generation module is configured to generate the first text data to be replied to in the current round of a multi-round voice dialogue; The data acquisition module is configured to acquire user input data of a preset round dialogue. The user input data includes raw audio data input by the user and input text data obtained by text conversion of the raw audio data. The preset round dialogue includes the current round and several rounds of dialogue before it. The dialogue model module is configured to extract target speech features from user input data in a preset round-robin dialogue, and generate target speech data based on the target speech features and the first text data. The voice output module is configured to play the target voice data.
[0022] In some possible implementations, the text generation module is configured as follows: In the current round of a multi-round voice dialogue, the first text data to be replied to in the current round is generated based on the input text data of the current round and the previous several rounds of dialogue; or, In the current round of a multi-round voice dialogue, the first text data to be replied to in the current round is generated based on the user input data of the preset round dialogue.
[0023] In some possible implementations, the target speech features include the user's emotional features, prosodic features, and acoustic environment features.
[0024] In some possible implementations, the dialogue model module is configured as follows: The user input data of the preset round-robin dialogue and the first text data are feature-encoded to obtain a feature sequence, and the encoded feature sequence is concatenated to obtain the input feature sequence; The input feature sequence is input into a pre-trained dialogue model, which extracts features from the input feature sequence to obtain the target speech features, and predicts and generates the target speech data based on the target speech features.
[0025] In some possible implementations, the dialogue model module is configured as follows: The audio encoding module encodes the original audio data included in the user input data to obtain an audio feature sequence, and the text encoding module encodes the input text data and the first text data included in the user input data to obtain a text feature sequence. The audio feature sequence and the text feature sequence are concatenated by the sequence concatenation module to obtain the input feature sequence.
[0026] In some possible implementations, the dialogue model module is configured as follows: The input feature sequence is input into a pre-trained dialogue model, and the dialogue model extracts features from the input feature sequence to obtain the target speech features and the second text data; The target speech data is generated based on the second text data and the target speech feature prediction.
[0027] In some possible implementations, the apparatus further includes a training module configured to: Obtain a sample dataset, which includes multiple sample data, each of which includes text and voice data from a multi-turn dialogue; For each sample data, the text data and speech data of the preset round dialogue are input into the dialogue model to be trained to obtain the predicted text of the current round dialogue output by the dialogue model. Based on the loss between the predicted text and the text labels of the current round of dialogue included in the sample data, the network parameters of the dialogue model are adjusted to obtain an intermediate model. Input the text and voice data of the preset turn-based dialogue into the intermediate model to obtain the predicted text and predicted audio of the current turn-based dialogue output by the intermediate model. Based on the first loss between the predicted text and the text label, and the second loss between the predicted audio and the speech data of the preset round dialogue, the network parameters of the dialogue model are adjusted until the model converges, thus obtaining the trained dialogue model.
[0028] In some possible implementations, the voice output module is configured as follows: The target speech data is decoded to obtain the target audio, and the target audio is waveform converted to obtain an audio waveform signal. The speech is then played according to the audio waveform signal.
[0029] Thirdly, embodiments of this specification provide an electronic device, including: processor; The memory stores computer instructions that cause the processor to perform the method described in any of the above embodiments.
[0030] Fourthly, embodiments of this specification provide a storage medium storing computer instructions for implementing the methods described in any of the above embodiments.
[0031] Fifthly, embodiments of this specification provide a computer program product for implementing the methods described in any of the above embodiments.
[0032] The dialogue generation method described in this specification includes generating first text data for the current round of dialogue and acquiring user input data for the current round and several previous rounds of dialogue. Target speech features are obtained by extracting features from the user input data, and target speech data is generated based on the target speech features and the first text data. In this embodiment, by combining the original audio data from multiple rounds of dialogue to extract user speech features, the method accurately analyzes features such as the current user's emotion, speech rhythm, and acoustic environment. It retains refined paralinguistic information that cannot be represented by text semantics, generating a speech response adapted to the current context, providing users with more human-like emotional feedback, and thus improving the effectiveness of voice dialogue. Attached Figure Description
[0033] To more clearly illustrate the technical solutions in the specific embodiments or related technologies of this specification, the drawings used in the description of the specific embodiments or related technologies will be briefly introduced below. Obviously, the drawings described below are some embodiments of this specification. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0034] Figure 1 This is a flowchart of a dialogue generation method in some embodiments of this specification.
[0035] Figure 2 This is a schematic diagram illustrating the generation principle of the first text data in some embodiments of this specification.
[0036] Figure 3 This is an architecture diagram of the dialogue system in some embodiments of this specification.
[0037] Figure 4 This is a flowchart of a dialogue generation method in some embodiments of this specification.
[0038] Figure 5 This is a structural block diagram of the dialogue generation device in some embodiments of this specification.
[0039] Figure 6 These are structural block diagrams of electronic devices in some embodiments of this specification. Detailed Implementation
[0040] The technical solutions of this specification will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this specification, not all of them. Based on the embodiments in this specification, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this specification. Furthermore, the technical features involved in the different embodiments of this specification described below can be combined with each other as long as they do not conflict with each other.
[0041] With the development of artificial intelligence (AI) technology, intelligent voice dialogue systems are widely used in scenarios such as in-vehicle voice assistants, smart speakers, and virtual human-computer interaction. Among them, multi-turn voice dialogue is the core interaction form of intelligent voice dialogue systems, and the core requirement of multi-turn dialogue is to maintain semantic consistency and logical coherence during the dialogue process.
[0042] In related technologies, solutions for consistency in multi-turn dialogues primarily rely on text semantic vectors. By extracting contextual semantic vectors from multiple turns of dialogue, the logical coherence of the dialogue content is ensured. At the speech level, auditory coherence mainly depends on global speech vectors. Models typically use fixed voiceprint features as generation conditions to ensure consistent timbre in machine responses. For example, users can manually select voiceprint features such as "deep male voice" or "sweet female voice," or the speech system can extract voiceprint features based on a reference speech input by the user, allowing the dialogue system to mimic those voiceprint features in its speech responses.
[0043] However, in related technical solutions, text semantics cannot express the subtle paralinguistic information such as tone, emotion, intonation, speech rate, and rhythm of a user's speech. Furthermore, information such as environmental noise background, user emotions, and breathing rhythm in historical audio is completely lost after conversion to text. This results in the dialogue system being unable to express subtle emotions such as "turning tears into laughter" or "awkward / grieved laughter," and can only mechanically broadcast voice responses. Moreover, fixed voiceprints can only ensure the consistency of the machine's tone in responses, and cannot provide emotional feedback that adapts to the user's emotions or context.
[0044] For example, in one scenario, a user asks in a low, tired voice late at night, "What will the weather be like tomorrow?" In related technical solutions, the dialogue system cannot adapt to the user's tired and depressed state, and can only reply with a standard broadcast tone, resulting in an emotional disconnect in human-computer interaction.
[0045] Furthermore, in related technical solutions, due to the loss of paralinguistic information, the model struggles to recognize dialogue content expressed by users in the form of rhetorical questions, sarcasm, or self-deprecation. For example, when a user says "What are you doing?" in a normal tone, it expresses a normal inquiry. However, if a user says "What are you doing!" in a stern, high-pitched tone, it often expresses anger and a desire to stop a certain behavior. But after speech is converted to text, the semantic meaning of the two is the same, and the model cannot recognize the changes in the user's emotions, thus outputting a response that deviates from the user's original context.
[0046] Based on this, the embodiments of this specification provide a dialogue generation method and apparatus, electronic device, storage medium, and computer program product, which aim to extract speech features such as emotion, rhythm, and acoustic environment of the user's speech and output a speech response that is adapted to the user's emotion and environmental background.
[0047] For example, in the scenario described above, if a user asks about the weather late at night in a low, tired voice, the dialogue system can respond with a comforting and gentle tone, rather than a cold, standard broadcast, thus providing emotional feedback to the user and improving the intelligence and humanization of the dialogue system.
[0048] The solution described in this manual is mainly applied to intelligent voice dialogue systems, which can be used in various human-computer dialogue scenarios.
[0049] For example, in one exemplary scenario, the dialogue system can be applied to electronic devices such as smart TVs, smart speakers, mobile phones, in-vehicle terminals, tablets, wearable devices, and all-in-one machines, serving as an intelligent voice assistant for these devices to enable human-computer voice interaction with users.
[0050] For example, in another exemplary scenario, a dialogue system can be applied to applications such as video games to provide real-time voice-over or dialogue for player characters or NPCs (Non-Player Characters) in the game.
[0051] Of course, the application scenarios of the solutions in this manual are not limited to the examples above. As long as they are applicable to multi-turn human-computer voice dialogue scenarios, the solutions in this manual can be applied. Since this manual cannot exhaustively list them, it will not elaborate further.
[0052] This specification addresses multi-turn dialogue in human-computer voice interaction. One turn of dialogue can be understood as one user statement and one system response. For example, if the user asks, "What's the weather like tomorrow?", the system might reply, "Tomorrow in Beijing, the weather will be sunny, with temperatures between 2°C and 6°C. It's a bit cold." Thus, one turn of dialogue can be represented as: { Q: What will the weather be like tomorrow? A: Tomorrow, Beijing will be sunny with temperatures ranging from 2°C to 6°C, which will be a bit chilly.
[0053] } Multi-turn dialogue refers to a series of consecutive user statements and system responses, as shown in the example below. { Q: What will the weather be like tomorrow? A: Tomorrow will be sunny with temperatures ranging from 2°C to 6°C, which will be a bit chilly.
[0054] Q: I'm going to Nanjing on a business trip tomorrow. Could you check the weather in Nanjing for me?
[0055] A: Nanjing will be cloudy tomorrow morning, with light rain expected around 2 pm. Temperatures will range from 4°C to 10°C. Remember to bring an umbrella.
[0056] } This example dialogue consists of two rounds, each round including one user statement and one system response, and the multiple rounds of dialogue are logically coherent.
[0057] As can be seen from the foregoing, in the relevant technical solutions, after the dialogue system collects the user's voice, it converts the audio data input by the user into text data through speech-to-text conversion. Then, the text data is input into the dialogue model, which predicts and outputs the system's response text. Finally, the vocoder broadcasts the system's response text according to a fixed timbre and tone, forming a voice response.
[0058] The approach described in this specification differs from this one. The input to the dialogue system in this specification includes not only the text data after audio-to-text conversion, but also the original audio data of the user's speech. Furthermore, it combines audio and text context data (such as audio and text data from historical multi-turn dialogues) to extract the speech features of the user during the current dialogue through the original audio data. Speech features may include, for example, the emotional features of the user's speech, prosodic features, and acoustic environmental features of the current environment. Based on the speech features, a speech response adapted to the current dialogue context and environment is generated.
[0059] To facilitate understanding and further explanation, an example of a human-computer multi-turn voice dialogue is provided here. The solution in this manual will be explained in conjunction with this example below.
[0060] Example Scenario 1: In Example Scenario 1, the user and the dialogue system engaged in three rounds of voice conversation. Rounds 1 and 2 were previous rounds, and round 3 was the current round. That is, in this example, after the user finishes speaking Q3, the dialogue system will generate a response speech A3. The following will use Example Scenario 1 as an example, combined with... Figure 1 The dialog generation method described in this manual is explained.
[0061] like Figure 1 As shown, in some embodiments, the dialogue generation method exemplified in this specification includes: S110. In the current round of a multi-round voice dialogue, generate the first text data to be replied to.
[0062] Referring to Example Scenario 1 above, the current round is the 3rd round of dialogue. First, it is necessary to generate the first text data T{curr_0} to be replied to in the current round (i.e., the 3rd round). The first text data T{curr_0} is the text data that the dialogue system will reply to the user. For example, in Example Scenario 1, in the 3rd round of dialogue, the user asks about the weather in Nanjing tomorrow, and the first text data T{curr_0} is the reply text generated for the user's Q3 voice.
[0063] It is understandable that in multi-turn dialogues, there are many ways for a dialogue system to generate text responses, such as generative large language models (LLMs) and various text dialogue models. Therefore, in the solution of this specification, there are no restrictions on the way to generate the first text data T{curr_0} to be replied to in the current turn, and any implementation method suitable for implementation can be adopted. Several implementation examples are given later in this specification, and will not be described in detail here.
[0064] S120: Obtain user input data for a preset round-robin dialogue.
[0065] It is worth noting that in this specification, after obtaining the first text data to be replied to in the current round, the system does not directly broadcast the first text data via voice as in related technical solutions. As mentioned above, if the first text data is broadcast directly via voice, the dialogue system can only output a voice reply with fixed voiceprint features, producing a voice broadcast with a specific timbre and style, without taking into account the user's current emotion and the context of the dialogue.
[0066] Therefore, in the scheme described in this specification, after obtaining the first text data to be replied to in the current round, it is necessary to further obtain the user input data of the preset round dialogue.
[0067] The preset round dialogue can include the current round and several rounds of dialogue before it. For example, the preset round dialogue can include the current round and N rounds of dialogue before it, where N≥1. In this specification, there is no restriction on the specific value of N. For example, in the aforementioned example scenario 1, assuming that N is 2, the preset round dialogue includes the current round (round 3) dialogue and the two rounds of dialogue before the current round (i.e., round 2 and round 1).
[0068] User input data includes raw audio data input by the user, which refers to the user's voice data captured by the microphone, i.e., the raw voice data used for text conversion. Simultaneously, user input data also includes input text data, representing the text data obtained after text conversion of the raw audio data.
[0069] For example, in Example Scenario 1, the preset round dialogue takes the current round and the previous two rounds as examples. The user input data for the preset round dialogue can be shown in the table below: In this specification, A{} represents audio data, T{} represents text data, prev represents the previous round of dialogue, and curr represents the current round of dialogue. Taking the original audio data A{prev_1} as an example, it represents the speech data of the user's speech collected by the microphone of the dialogue system. The original audio data includes not only the speaker's (i.e., the user's) audio waveform but also the ambient audio waveform of the current environment. Similarly, taking the input text data T{prev_1} as an example, it represents the text data obtained by performing text conversion on the original audio data A{prev_1}, extracting only the semantic part of the audio data.
[0070] In other words, in this example, the user input data obtained by the dialogue system for the preset rounds of dialogue includes the original audio data A{prev_1} and input text data T{prev_1} for the first round, the original audio data A{prev_2} and input text data T{prev_2} for the second round, and the original audio data A{curr_3} and input text data T{curr_3} for the current round, represented as: {A{prev_1}, A{prev_2}, A{curr_3}, T{prev_1}, T{prev_2}, T{curr_3}}. The output target of the dialogue system is the response voice data A{curr_0} for the current round (i.e., the third round), which is the target voice data A{curr_0} in this specification.
[0071] In the embodiments described in this specification, the process of converting the original audio data into text to obtain the input text data can be referred to the speech-to-text algorithms of related technologies, and will not be described in detail here.
[0072] S130. Extract the target speech features from the user input data of the preset round dialogue, and generate the target speech data based on the target speech features and the first text data.
[0073] It is understood that the user input data in a pre-defined round-robin dialogue includes both audio context data and text context data. In the solution described in this specification, the dialogue system can perform feature extraction on the raw audio data included in the user input data. The purpose of feature extraction is to extract the target speech features included in the raw audio data.
[0074] As mentioned above, the original audio data is the user's speech waveform data collected by the microphone, which includes the speaker's speech data and ambient sound data. In the scheme of this specification, the target speech features may include emotional features used to characterize the user's emotions or feelings, prosodic features used to characterize the user's speech rate / intonation / stress habits, and acoustic environmental features used to characterize the background audio of the environment.
[0075] For example, in the aforementioned example, a user asks about the weather late at night in a low, tired tone. In this case, the original audio data includes the user's emotional state as drowsy and tired, the speech rhythm as slow and low, and the background environment as quiet. In the solution described in this specification, the dialogue system can extract user emotional features, speech rhythm features, and acoustic environment features from the original audio data of preset rounds of dialogue. These features are collectively referred to as target speech features, which reflect the non-textual semantic features carried in the audio data.
[0076] In some implementations, an end-to-end (E2E) dialogue model is employed. That is, the input to the dialogue model consists of user input data from a pre-defined dialogue round and first text data T{curr_0}, and the output is the target speech data. The network layer in the dialogue model extracts target speech features by combining the audio and text contexts of previous dialogue rounds, and then fuses these features with the first text data to generate the target speech data. The model processing flow is described below.
[0077] It is understandable that by combining audio and text contexts to obtain target speech features, the emotions and context of the current voice dialogue can be integrated into the generated target speech data. The target speech data output by the dialogue system can automatically adapt tone, speech rate, volume, and emotion according to the user's emotions and environment, achieving more humanized emotional feedback.
[0078] For example, in the aforementioned scenario 1, the text content corresponding to the target speech data A{curr_0} generated by the dialogue system in the current round is as follows: It is worth noting that in the embodiments of this specification, the text content corresponding to the target speech data A{curr_0} can be the same as or different from the first text data T{curr_0}. For example, in one example, the text content corresponding to the target speech data A{curr_0} can be the same as the first text data T{curr_0}, but the speech data of the first text data T{curr_0} is broadcast in a tone that adapts to the current dialogue context, rather than a standard broadcast tone. In another example, the text content corresponding to the target speech data A{curr_0} can be the text content after adjusting the first text data T{curr_0} according to the current context. For example, the dialogue system adds comforting text content to the first text data T{curr_0} based on the target speech characteristics, and then broadcasts the modified text content in a tone that adapts to the current dialogue context. Those skilled in the art will understand this, and this specification will not elaborate further.
[0079] S140, Play the target voice data.
[0080] In the embodiments described in this specification, after obtaining the target voice data, the target voice data can be played through a vocoder to output the voice response for the current round to the user.
[0081] For example, in the aforementioned example scenario 1, in the current round (3rd round) of dialogue, although the content of the dialogue system's reply is about the weather, the dialogue system will still broadcast the target voice data A{curr_0} in a comforting and cheerful tone to give the user emotional comfort. Moreover, the speech rate, tone, emphasis, and volume of the voice broadcast are more in line with the current dialogue context and background environment, achieving humanized emotional feedback.
[0082] As can be seen from the above, in the embodiments of this specification, by combining the original audio data of multi-turn dialogues to extract user voice features, the current user's emotions, speech rhythm, acoustic environment and other features are accurately analyzed. This preserves the refined paralinguistic information that cannot be represented by text semantics, generates voice responses that are adapted to the current context, provides users with more humanized emotional feedback, and thus improves the effect of voice dialogue.
[0083] In some implementations, during each round of dialogue, the first text data T{curr_0} to be replied to in the current round can be generated based on the input text data of the current round and the previous several rounds of dialogue.
[0084] For example, in one example, such as Figure 2 As shown in (a), the text model can be used to generate the first text data T{curr_0} for the current round. The input of the text model includes the input text data of the current round and several previous rounds of dialogue. For example, in Example Scenario 1, the input of the text model is {T{prev_1}, T{prev_2}, T{curr_3}}. The output of the text model is the first text data T{curr_0} to be replied to in the current round (round 3). In this example, the text model generates the reply text data for the current round, i.e., the first text data T{curr_0}, based on the text context data of the current round and several previous rounds.
[0085] In this example implementation, the text model predicts the first text data based on a preset multi-round input text data. It can accurately capture the semantic logic of multi-round dialogues by combining text context, ensuring semantic coherence and contextual relevance. Furthermore, the model has a single input dimension, high computational efficiency, and is suitable for electronic devices with low computing power or high response speed requirements. In other implementations, during each round of dialogue, the first text data T{curr_0} to be replied to in the current round can be generated based on the user input data of the current round and several previous rounds of dialogue.
[0086] For example, in one example, such as Figure 2 As shown in (b), with Figure 2The difference in (a) is that the input of the text model includes not only the input text data {T{prev_1}, T{prev_2}, T{curr_3}} of the current round and several rounds of dialogue, but also the original audio data {A{prev_1}, A{prev_2}, A{curr_3}} of the current round and several rounds of dialogue. The text model generates the response text data for the current round, i.e., the first text data T{curr_0}, based on the text context data and audio context data of the current round and several rounds of dialogue.
[0087] In this example implementation, the text model combines pre-set multiple rounds of input text data and original audio data to predict the first text data. By fusing audio context information, non-semantic paralinguistic information can be extracted, so that the generated data content not only conforms to semantic logic, but also adapts to the current dialogue context, providing a data foundation for subsequent accurate extraction of speech features.
[0088] exist Figure 2 In the examples, the text model can adopt any model architecture suitable for implementation in the relevant technologies, and this specification does not limit it.
[0089] In some implementations, after obtaining the first text data T{curr_0} of the current round of dialogue, the dialogue system can employ an end-to-end architecture for its speech dialogue model. This model uses the user input data from the pre-defined round of dialogue and the first text data as input to the dialogue model, thereby predicting and outputting the target speech data A{curr_0} for the current round of dialogue. For example... Figure 3 The architecture of the dialogue system in some embodiments of this specification is shown below, in conjunction with... Figure 3 Please provide an explanation.
[0090] like Figure 3 As shown, the dialogue system includes an audio encoding module, a text encoding module, a sequence concatenation module, a dialogue model, a text guidance generation module, an audio decoder, and a vocoder. Taking the aforementioned example scenario 1 as an example, the generation process of the target speech data A{curr_0} for the current round (round 3) of the dialogue is as follows: First, through the aforementioned processes S110 and S120, the first text data T{curr_0} of the current round of dialogue, the original audio data {A{prev_1}, A{prev_2}, A{curr_3}} of the preset round of dialogue, and the input text data {T{prev_1}, T{prev_2}, T{curr_3}} of the preset round of dialogue are obtained respectively.
[0091] The audio encoding module encodes the original audio data {A{prev_1}, A{prev_2}, A{curr_3}} to obtain an audio feature sequence. An audio feature sequence refers to the sequence of features (tokens) obtained by discretizing the original audio waveform signal into audio tokens through the audio encoding module. In the field of artificial intelligence, a token is the smallest unit that a model can process. In this example, the audio feature sequence is a sequence of features composed of discrete audio tokens, i.e., audio tokens.
[0092] The text encoding module encodes the input text data {T{prev_1}, T{prev_2}, T{curr_3}} and the first text data T{curr_0} to obtain a text feature sequence. The text feature sequence refers to the feature sequence (tokens) obtained by segmenting the text information through the text encoding module; that is, text tokens. It is worth noting that in... Figure 3 For ease of distinction, the text encoding processes of the input text data and the first text data are shown separately. In fact, the text encoding modules corresponding to the two can be the same module, which will be understood by those skilled in the art, and will not be described in detail here.
[0093] Subsequently, the sequence concatenation module concatenates the audio feature sequence output by the audio encoder and the text encoding module output by the text encoding module to obtain a continuous and unified input sequence, which is the input feature sequence described in this specification.
[0094] In some implementations, the audio encoding module can employ a Transformer-based audio encoder, combined with discretization techniques such as residual vector quantization, to preserve speech features such as prosody and timbre while maintaining semantic information. The text encoding module can employ a word segmenter.
[0095] Then, the input feature sequence is fed into the pre-trained dialogue model. In some implementations, the dialogue model can employ a large language model with an autoregressive decoder-only architecture. The conditional probability of the dialogue model generating the target speech data is represented as: P(A{curr_0} ∣ T{curr_0}, A{hist}, T{hist}) ∝ Decoder(Encoder(T{curr_0}), Context(A{hist}, T{hist})). Here, A{hist} represents the original audio data of the preset round dialogue, T{hist} represents the input text data of the preset round dialogue, P represents the probability, Decoder() represents decoding, and Encoder() represents encoding. In some implementations of this specification, the dialogue model uses a standard self-attention mechanism to focus on the correlation between tokens at different positions on the input feature sequence, simultaneously modeling audio tokens and text tokens. Since the original speech data of the preset round dialogue is directly concatenated into the input feature sequence, the model can extract the target speech features of the previous round dialogues through the attention mechanism, without the need for an explicit feature extraction or injection module.
[0096] In the above implementation, by encoding and sequentially concatenating the text and audio data of the preset turn-based dialogue, the audio and text features are fused in an orderly manner. The dialogue model can accurately extract the target speech features that integrate text semantics and audio paralinguistic features, accurately capture key features such as user emotions and acoustic environment, make the target speech data output by the model more in line with the current context, and reduce model complexity and training cost.
[0097] In some implementations, during the process of predicting and generating target speech data A{curr_0}, the dialogue model can simultaneously generate second text data based on the input feature sequence. The second text data refers to the text data corresponding to the target speech data A{curr_0}. Then, the text guidance generation module guides the speech features of the output result, thereby obtaining the target speech data A{curr_0} output by the dialogue model.
[0098] In the above implementation, the dialogue model simultaneously extracts target speech features and second text data from the input feature sequence, and uses the second text data to guide the generation of target speech data. The second text data can form a secondary constraint and guidance at the semantic level for speech generation, which can not only ensure that the target speech features fit the user's emotions and acoustic environment, but also make the generated target speech data maintain logical coherence in semantics and avoid semantic deviation.
[0099] It is understandable that the target speech data A{curr_0} output by the dialogue model is first decoded by the audio decoder to obtain the target audio, which is a sequence of audio tokens. The vocoder can restore the audio token sequence to a complete audio waveform signal. Then the speaker plays the speech according to the audio waveform signal, and the dialogue system realizes the speech response of the current round of dialogue.
[0100] By using audio decoding and vocoder audio restoration, an end-to-end conversion from target speech data generated by the model to speech playback is achieved, preserving the emotion, rhythm and other features of the target speech data, and improving the human-computer interaction experience.
[0101] As described above, the implementation method in this specification, by combining the audio and text contexts of previous dialogue rounds, can ensure the logical consistency of multi-turn dialogues while enabling the dialogue system to adapt to user emotions and contexts, providing more human-like emotional feedback. Furthermore, through an end-to-end dialogue model architecture, the dialogue model directly outputs the voice response data for each round of dialogue, avoiding the disconnect between the model output and the user input in acoustic environment modeling, thus better ensuring the coherence of dialogue speech and semantics. Moreover, by uniformly concatenating the audio and text contexts into a complete input sequence as model input, model complexity and training costs are reduced.
[0102] In some implementations, this specification provides a method for training the above-described dialogue model, which is described below in conjunction with... Figure 4 Please provide an explanation.
[0103] like Figure 4 As shown, in some implementations, the dialogue generation method exemplified in this specification includes the following model training process: S410. Obtain the sample dataset.
[0104] A sample dataset refers to a collection of sample data used to train a dialogue model. It includes a large amount of sample data, which can originate from multi-turn human-to-human dialogues in real-world scenarios. Each sample dataset contains data from multiple turns of dialogue, and each turn includes both text and audio data. Additionally, the system's responses in each turn can be labeled with text tags.
[0105] In the embodiments described in this specification, the sample data can contain rich paralinguistic information, such as breathing, sighs, laughter, interjections, pauses, etc. The scale of the sample data can reach a massive level, thereby ensuring that the dialogue model exhibits context-based few-shot learning capabilities.
[0106] S420. For each sample data, input the text data and speech data of the preset round dialogue into the dialogue model to be trained, and obtain the predicted text of the current round dialogue output by the dialogue model.
[0107] S430. Based on the loss between the predicted text and the text labels of the current round of dialogue included in the sample data, adjust the network parameters of the dialogue model to obtain an intermediate model.
[0108] In the implementation method described in this specification, model training is divided into two stages: a speech understanding stage and a speech understanding-generation joint training stage. In the speech understanding stage, only the text prediction loss is calculated, enabling the model to learn to understand speech input and establish a relational representation between audio and text. In the speech understanding-generation joint training stage, both the text prediction loss and the pre-set audio loss are calculated simultaneously, enabling the model to generate speech. S420-S430 represent the training process of the speech understanding stage.
[0109] In the process of S420~S430, taking any one or a batch of sample data as an example, see the above. Figure 3 As shown, the sample data, which includes pre-defined round-robin dialogue text and speech data, is first encoded by the audio encoding module and the text encoding module, and then concatenated by the sequence concatenation module before being input into the dialogue model to be trained.
[0110] The dialogue model first predicts the text for the current round of dialogue. This predicted text represents the text content the model predicts the response is needed in the current round, while the text labels in the sample data represent the response text content in a real-world interpersonal dialogue. The goal of model training is to minimize the loss between the model's predicted text and the text labels. Therefore, in this stage, the loss between the model's predicted text and the text labels can be calculated based on a preset loss function. This loss value is the text prediction loss. Then, the network parameters of the dialogue model are adjusted based on this loss value, thus completing one round of iterative training. The loss function can be, for example, the cross-entropy loss function; this specification does not impose any restrictions.
[0111] The above describes one round of iterative training for the speech understanding stage. Repeat the above iterative training process until the dialogue model meets the convergence condition, and the training of the speech understanding stage is completed, resulting in an intermediate model.
[0112] S440. Input the text data and voice data of the preset turn-based dialogue into the intermediate module to obtain the predicted text and predicted audio of the current turn-based dialogue output by the intermediate model.
[0113] S450. Based on the first loss between predicted text and text labels, and the second loss between preset audio and preset turn dialogue speech data, adjust the network parameters of the dialogue model until the model converges to obtain the trained dialogue model.
[0114] After completing the model training for the speech understanding phase (S420-S430), the speech understanding-generation joint training phase (S440-S450) can begin.
[0115] In processes S440 to S450, again taking any one or a batch of sample data as an example, see the aforementioned... Figure 3 As shown, the sample data, which includes pre-defined round-trip dialogue text and speech data, is first encoded by the audio encoding module and the text encoding module, and then concatenated by the sequence concatenation module before being input into the intermediate model obtained from the training process.
[0116] The intermediate model first predicts the text for the current turn of dialogue, then calculates the loss value based on the predicted text and text labels to obtain the first loss. This process is described in steps S420-S430 above and will not be repeated here. Simultaneously, the intermediate model predicts the audio for the current turn of dialogue. This predicted audio represents the audio data predicted by the model for the response needed in the current turn of dialogue. The speech features of this predicted audio should closely match the user's speech data from previous turns of dialogue as closely as possible. Therefore, in this stage, the speech data from a preset turn of dialogue can be used as audio labels. Based on a preset loss function, the loss value between the model's predicted audio and the audio labels is calculated to obtain the second loss.
[0117] In this training phase, the loss function includes loss terms for both the first and second losses, and corresponding weight values can be set for each. For example, in some implementations, the weight value of the first loss can be set higher than that of the second loss to prioritize the accuracy of semantic understanding. Then, the network parameters of the dialogue model are tuned based on the first and second losses, thus completing one round of iterative training.
[0118] The above describes one round of iterative training in the joint training phase of speech understanding and generation. Repeat the above iterative training process until the dialogue model meets the convergence condition, and the training of this phase is completed, resulting in the trained dialogue model.
[0119] As described above, in this embodiment, a two-stage progressive model training process is employed. First, based on text prediction loss, the model fully understands the speech input and establishes a relational representation between audio and text, improving the model's semantic understanding ability and ensuring logical consistency in multi-turn dialogues. Then, the text and audio prediction losses are jointly calculated, allowing the model to learn speech generation capabilities while retaining strong semantic understanding. This enables the model to accurately capture the semantic relationship between speech and text, generating context-appropriate response speech, thereby improving the effectiveness of voice dialogue.
[0120] In some embodiments, this specification provides a dialogue generation apparatus, such as Figure 5 As shown, the device includes: The text generation module 10 is configured to generate the first text data to be replied to in the current round of a multi-round voice dialogue; The data acquisition module 20 is configured to acquire user input data of a preset round dialogue. The user input data includes raw audio data input by the user and input text data obtained by text conversion of the raw audio data. The preset round dialogue includes the current round and several rounds of dialogue before it. The dialogue model module 30 is configured to extract target speech features from user input data in a preset round-robin dialogue, and generate target speech data based on the target speech features and the first text data. The voice output module 40 is configured to play the target voice data.
[0121] In some embodiments, the text generation module 10 is configured to: In the current round of a multi-round voice dialogue, the first text data to be replied to in the current round is generated based on the input text data of the current round and the previous several rounds of dialogue; or, In the current round of a multi-round voice dialogue, the first text data to be replied to in the current round is generated based on the user input data of the preset round dialogue.
[0122] In some implementations, the target speech features include the user's emotional features, prosodic features, and acoustic environment features.
[0123] In some implementations, the dialogue model module 30 is configured to: The user input data of the preset round-robin dialogue and the first text data are feature-encoded to obtain a feature sequence, and the encoded feature sequence is concatenated to obtain the input feature sequence; The input feature sequence is input into a pre-trained dialogue model, which extracts features from the input feature sequence to obtain the target speech features, and predicts and generates the target speech data based on the target speech features.
[0124] In some implementations, the dialogue model module 30 is configured to: The audio encoding module encodes the original audio data included in the user input data to obtain an audio feature sequence, and the text encoding module encodes the input text data and the first text data included in the user input data to obtain a text feature sequence. The audio feature sequence and the text feature sequence are concatenated by the sequence concatenation module to obtain the input feature sequence.
[0125] In some implementations, the dialogue model module 30 is configured to: The input feature sequence is input into a pre-trained dialogue model, and the dialogue model extracts features from the input feature sequence to obtain the target speech features and the second text data; The target speech data is generated based on the second text data and the target speech feature prediction.
[0126] In some embodiments, the apparatus further includes a training module configured to: Obtain a sample dataset, which includes multiple sample data, each of which includes text and voice data from a multi-turn dialogue; For each sample data, the text data and speech data of the preset round dialogue are input into the dialogue model to be trained to obtain the predicted text of the current round dialogue output by the dialogue model. Based on the loss between the predicted text and the text labels of the current round of dialogue included in the sample data, the network parameters of the dialogue model are adjusted to obtain an intermediate model. Input the text and voice data of the preset turn-based dialogue into the intermediate model to obtain the predicted text and predicted audio of the current turn-based dialogue output by the intermediate model. Based on the first loss between the predicted text and the text label, and the second loss between the predicted audio and the speech data of the preset round dialogue, the network parameters of the dialogue model are adjusted until the model converges, thus obtaining the trained dialogue model.
[0127] In some embodiments, the voice output module 40 is configured to: The target speech data is decoded to obtain the target audio, and the target audio is waveform converted to obtain an audio waveform signal. The speech is then played according to the audio waveform signal.
[0128] In some embodiments, this specification provides an electronic device comprising: processor; The memory stores computer instructions that cause the processor to perform the method described in any of the above embodiments.
[0129] In some embodiments, this specification provides a storage medium storing computer instructions for implementing the methods described in any of the above embodiments.
[0130] In some embodiments, this specification provides a computer program product for implementing the methods described in any of the above embodiments.
[0131] Figure 6 The diagram shows a structural block diagram of an electronic device according to some embodiments of the present disclosure. The following is a description of the structure of the device in conjunction with the provided text. Figure 6 Some embodiments of the electronic device described herein will be explained.
[0132] Reference Figure 6 The electronic device 1800 may include one or more of the following components: processing component 1802, memory 1804, power supply component 1806, multimedia component 1808, audio component 1810, input / output (I / O) interface 1812, sensor component 1816, and communication component 1818.
[0133] Processing component 1802 typically controls the overall operation of electronic device 1800, such as operations associated with display, telephone calls, data communication, camera operation, and recording operations. Processing component 1802 may include one or more processors 1820 to execute instructions. Furthermore, processing component 1802 may include one or more modules to facilitate interaction between processing component 1802 and other components. For example, processing component 1802 may include a multimedia module to facilitate interaction between multimedia component 1808 and processing component 1802. As another example, processing component 1802 may read executable instructions from memory to implement relevant functions of the electronic device.
[0134] Memory 1804 is configured to store various types of data to support the operation of electronic device 1800. Examples of this data include instructions for any application or method operating on electronic device 1800, contact data, phonebook data, messages, pictures, videos, etc. Memory 1804 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk.
[0135] Power supply component 1806 provides power to various components of electronic device 1800. Power supply component 1806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic device 1800.
[0136] The multimedia component 1808 includes a display screen that provides an output interface between the electronic device 1800 and the user. In some embodiments, the multimedia component 1808 includes a front-facing camera and / or a rear-facing camera. When the electronic device 1800 is in an operating mode, such as a shooting mode or a video mode, the front-facing camera and / or the rear-facing camera can receive external multimedia data. Each front-facing camera and rear-facing camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
[0137] Audio component 1810 is configured to output and / or input audio signals. For example, audio component 1810 includes a microphone (MIC) configured to receive external audio signals when electronic device 1800 is in an operating mode, such as call mode, recording mode, and voice recognition mode. The received audio signals may be further stored in memory 1804 or transmitted via communication component 1818. In some embodiments, audio component 1810 also includes a speaker for outputting audio signals.
[0138] I / O interface 1812 provides an interface between processing component 1802 and peripheral interface modules, such as keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to, home buttons, volume buttons, power buttons, and lock buttons.
[0139] Sensor assembly 1816 includes one or more sensors for providing state assessments of various aspects of electronic device 1800. For example, sensor assembly 1816 may detect the on / off state of electronic device 1800, the relative positioning of components such as the display and keypad of electronic device 1800, changes in position of electronic device 1800 or a component of electronic device 1800, the presence or absence of user contact with electronic device 1800, the orientation or acceleration / deceleration of electronic device 1800, and temperature changes of electronic device 1800. Sensor assembly 1816 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. Sensor assembly 1816 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, sensor assembly 1816 may also include an accelerometer, gyroscope, magnetometer, pressure sensor, or temperature sensor.
[0140] Communication component 1818 is configured to facilitate wired or wireless communication between electronic device 1800 and other devices. Electronic device 1800 can access wireless networks based on communication standards, such as Wi-Fi, 2G, 3G, 4G, 5G, or 6G, or combinations thereof. In one exemplary embodiment, communication component 1818 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, communication component 1818 also includes a near-field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
[0141] In an exemplary embodiment, the electronic device 1800 may be implemented by one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components.
[0142] Obviously, the above embodiments are merely examples for clear illustration and are not intended to limit the embodiments. Those skilled in the art will recognize that other variations or modifications can be made based on the above description. It is neither necessary nor possible to exhaustively list all embodiments here. However, obvious variations or modifications derived therefrom remain within the scope of protection created by this specification.
Claims
1. A dialogue generation method, characterized in that, include: In the current round of a multi-turn voice dialogue, generate the first text data to be replied to; Acquire user input data for a preset round of dialogue. The user input data includes the original audio data input by the user and the input text data obtained by converting the original audio data into text. The preset round of dialogue includes the current round and several rounds of dialogue before it. The target speech features are obtained by extracting features from the user input data of the preset round-robin dialogue, and the target speech data is generated based on the target speech features and the first text data. Play the target audio data.
2. The method according to claim 1, characterized in that, In the current round of a multi-round voice dialogue, the generation of the first text data to be replied to includes: In the current round of the multi-round voice dialogue, the first text data to be replied to in the current round is generated based on the input text data of the preset round dialogue; or, In the current round of the multi-round voice dialogue, the first text data to be replied to in the current round is generated based on the user input data of the preset round dialogue.
3. The method according to claim 1 or 2, characterized in that, The step of extracting target speech features from user input data in a preset round-robin dialogue, and generating target speech data based on the target speech features and the first text data, includes: The user input data of the preset round-robin dialogue and the first text data are feature-encoded to obtain a feature sequence, and the encoded feature sequence is concatenated to obtain an input feature sequence; The input feature sequence is input into a pre-trained dialogue model, which extracts features from the input feature sequence to obtain the target speech features, and predicts and generates the target speech data based on the target speech features.
4. The method according to claim 3, characterized in that, The step of inputting the input feature sequence into a pre-trained dialogue model, wherein the dialogue model extracts features from the input feature sequence to obtain the target speech features, and predicts and generates the target speech data based on the target speech features, includes: The input feature sequence is input into the pre-trained dialogue model, and the dialogue model performs feature extraction on the input feature sequence to obtain the target speech features and the second text data; The target speech data is generated based on the second text data and the target speech feature prediction.
5. The method according to claim 3, characterized in that, The training process of the dialogue model includes: Obtain a sample dataset, which includes multiple sample data, each of which includes text and voice data from a multi-turn dialogue; For each sample data, the text data and speech data of the preset round dialogue are input into the dialogue model to be trained to obtain the predicted text of the current round dialogue output by the dialogue model. Based on the loss between the predicted text and the text labels of the current round of dialogue included in the sample data, the network parameters of the dialogue model are adjusted to obtain an intermediate model. Input the text and voice data of the preset turn-based dialogue into the intermediate model to obtain the predicted text and predicted audio of the current turn-based dialogue output by the intermediate model. Based on the first loss between the predicted text and the text label, and the second loss between the predicted audio and the speech data of the preset round dialogue, the network parameters of the dialogue model are adjusted until the model converges, thus obtaining the trained dialogue model.
6. The method according to claim 1, characterized in that, The playback of the target audio data includes: The target speech data is decoded to obtain the target audio, and the target audio is waveform converted to obtain an audio waveform signal. The speech is then played according to the audio waveform signal.
7. A dialogue generation device, characterized in that, include: The text generation module is configured to generate the first text data to be replied to in the current round of a multi-round voice dialogue; The data acquisition module is configured to acquire user input data of a preset round dialogue. The user input data includes raw audio data input by the user and input text data obtained by text conversion of the raw audio data. The preset round dialogue includes the current round and several rounds of dialogue before it. The dialogue model module is configured to extract target speech features from user input data in a preset round-robin dialogue, and generate target speech data based on the target speech features and the first text data. The voice output module is configured to play the target voice data.
8. The apparatus according to claim 7, characterized in that, The text generation module is configured as follows: In the current round of a multi-round voice dialogue, the first text data to be replied to in the current round is generated based on the input text data of the current round and the previous several rounds of dialogue; or, In the current round of a multi-round voice dialogue, the first text data to be replied to in the current round is generated based on the user input data of the preset round dialogue.
9. The apparatus according to claim 7, characterized in that, The dialogue model module is configured as follows: The user input data of the preset round-robin dialogue and the first text data are feature-encoded to obtain a feature sequence, and the encoded feature sequence is concatenated to obtain the input feature sequence; The input feature sequence is input into a pre-trained dialogue model, which extracts features from the input feature sequence to obtain the target speech features, and predicts and generates the target speech data based on the target speech features.
10. An electronic device, characterized in that, include: processor; A memory storing computer instructions for causing a processor to perform the method according to any one of claims 1 to 6.
11. A storage medium, characterized in that, The device stores computer instructions for implementing the method as described in any one of claims 1 to 6.
12. A computer program product, characterized in that, The computer program product is used to implement the method as described in any one of claims 1 to 6.