Dialogue data construction, voice interaction method and device, equipment and medium
By constructing dynamic dialogue data, target dialogue voices including speaker interruptions and multi-turn dialogue scenarios are generated, and a voice interaction model is trained. This solves the problem of insufficient ability of existing models to handle real-world interactive scenarios and improves the user experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ALIBABA (CHINA) CO LTD
- Filing Date
- 2024-12-18
- Publication Date
- 2026-06-19
AI Technical Summary
Existing voice interaction models lack the ability to handle real-world interactive scenarios, resulting in a poor user experience.
By constructing dynamic dialogue data, including interruptions between speakers and multi-turn dialogue scenarios, target dialogue speech is generated and a voice interaction model is trained.
It improves the adaptability of the voice interaction model to real-world interaction scenarios and enhances the user experience.
Smart Images

Figure CN122245287A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, and in particular to a dialogue data construction, voice interaction method, apparatus, device and medium. Background Technology
[0002] With the development of computer technology, voice interaction systems capable of engaging in voice dialogue with users are becoming increasingly widespread. These systems can interact with users based on pre-trained voice interaction models.
[0003] Currently, the sample datasets used to train the aforementioned voice interaction models mainly include voice dialogue data and transcribed text. However, using existing voice dialogue data as sample datasets for model training results in a poorly trained voice interaction model that lacks the ability to handle real-world interaction scenarios, leading to a poor user experience. Summary of the Invention
[0004] This application provides a dialogue data construction method, voice interaction method, apparatus, device, and medium that can improve the user's voice interaction experience.
[0005] Firstly, this application provides a method for constructing dialogue data, the method comprising:
[0006] Acquire initial dialogue data to be processed; the initial dialogue data includes: initial dialogue text, the initial dialogue text does not contain dialogue content under the target dialogue scenario, the target dialogue scenario is: the interaction scenario between the first speaker and the second speaker;
[0007] Based on the initial dialogue text, construct a target dialogue speech that includes the dialogue content in the target dialogue scenario;
[0008] Based on the target dialogue voice, target dialogue data is obtained; the target dialogue data is used to train a preset model to obtain a voice interaction model, and the voice interaction model is used to conduct voice interaction with the user in the target dialogue scenario.
[0009] Optionally, the interaction scenario between the first speaker and the second speaker includes: the first speaker interrupting the second speaker, and the step of constructing target dialogue speech including dialogue content under the target dialogue scenario based on the initial dialogue text includes:
[0010] An interrupt marker is inserted at the target position of the initial dialogue text to obtain dialogue text including the interrupt marker;
[0011] Based on the dialogue text including the interruption marker, generate a target dialogue text at the target location where the first speaker interrupts the second speaker.
[0012] The target dialogue text is converted into speech to obtain the target dialogue speech.
[0013] Optionally, inserting an interruption marker at the target position of the initial dialogue text to obtain dialogue text including the interruption marker includes:
[0014] Obtain the first prompt content; the first prompt content includes: the initial dialogue text, the first instruction information, and at least one of the following: including an instance of the dialogue text with an interruption marker, and the condition that the target position must meet; the first instruction information is used to instruct the insertion of the interruption marker into the initial dialogue text;
[0015] The first prompt content is input into the interruption flag processing model to obtain the dialogue text including the interruption flag.
[0016] Optionally, generating target dialogue text at the target location where the first speaker interrupts the second speaker, based on the dialogue text including the interruption marker, includes:
[0017] Obtain the second prompt content, which includes: the dialogue text including the interruption mark, the second instruction information, and at least one of the following: the dialogue text instance in the target dialogue scenario, and the conditions that the target dialogue text must meet; the second instruction information is used to instruct the generation of a target dialogue text at the target location, based on the dialogue text including the interruption mark, in which the first speaker interrupts the second speaker.
[0018] The second prompt is input into the text generation model to obtain the target dialogue text.
[0019] Optionally, the interaction between the first speaker and the second speaker includes: the first speaker interrupting the second speaker; the initial dialogue data further includes: initial dialogue speech; and before obtaining the target dialogue data based on the target dialogue speech, the method further includes:
[0020] Obtain overlapping speech from the initial dialogue;
[0021] Based on the overlapping speech in the dialogue, the target dialogue speech is obtained.
[0022] Optionally, obtaining the target dialogue speech based on the overlapping dialogue speech includes:
[0023] Obtain the timbre similarity of different speakers in the overlapping speech of the dialogue;
[0024] Speech segments with timbre similarity greater than or equal to a preset similarity threshold are filtered out to obtain filtered overlapping speech segments.
[0025] Based on the filtered overlapping speech, the target speech is obtained.
[0026] Optionally, obtaining the target dialogue speech based on the filtered overlapping dialogue speech includes:
[0027] Based on the filtered overlapping dialogue speech, a speech segment in the initial dialogue speech that meets a preset condition is determined as the target dialogue speech; the preset condition includes at least one of the following: the speech segment includes the filtered overlapping dialogue speech, and the speech segment makes the duration of the filtered overlapping dialogue speech greater than or equal to a preset duration.
[0028] Optionally, obtaining overlapping speech in the initial dialogue speech where speaker voices overlap includes:
[0029] Speaker segmentation is performed on the initial dialogue speech to obtain speaker segmentation results; the speaker segmentation results are used to characterize the speech segments corresponding to each speaker in the initial dialogue speech.
[0030] Based on the speaker separation results, overlapping speech in the dialogue is identified where speaker speech overlap exists.
[0031] Optionally, the interaction scenario between the first speaker and the second speaker may further include: the first speaker and the second speaker engaging in multi-turn dialogue, wherein the initial dialogue voice is multi-turn dialogue voice, and / or, the initial dialogue text is multi-turn dialogue text.
[0032] Optionally, obtaining target dialogue data based on the target dialogue speech includes:
[0033] Based on the timestamps corresponding to the speech content of each speaker in the target dialogue voice, the duration of the target dialogue voice and the time interval between different speakers are adjusted to obtain the adjusted target dialogue voice.
[0034] The target dialogue data is obtained based on the adjusted target dialogue voice.
[0035] Secondly, this application provides a voice interaction method, the method comprising:
[0036] Receive user voice messages;
[0037] The user's voice is input into a pre-trained voice interaction model to obtain feedback voice in the target dialogue scenario; the voice interaction model is obtained by training a preset model based on the target dialogue data as described in any of the first aspects; the target dialogue scenario is: an interaction scenario between a first speaker and a second speaker.
[0038] Output the feedback voice.
[0039] Optionally, the method is applied to a voice interaction system, and the interaction scenarios between the first speaker and the second speaker include: the first speaker interrupting the second speaker; when the first speaker is a user, the second speaker is the voice interaction system; when the first speaker is the voice interaction system, the second speaker is the user.
[0040] Thirdly, this application provides a dialogue data construction apparatus, the apparatus comprising:
[0041] The acquisition module is used to acquire initial dialogue data to be processed; the initial dialogue data includes: initial dialogue text, the initial dialogue text does not contain dialogue content under the target dialogue scenario, the target dialogue scenario is: the interaction scenario between the first speaker and the second speaker;
[0042] A construction module is used to construct target dialogue speech including dialogue content under the target dialogue scenario based on the initial dialogue text; obtain target dialogue data based on the target dialogue speech; use the target dialogue data to train a preset model to obtain a voice interaction model; and use the voice interaction model to perform voice interaction with the user under the target dialogue scenario.
[0043] Fourthly, this application provides an electronic device, including: a processor and a memory; the processor and the memory are communicatively connected.
[0044] The memory stores computer-executed instructions;
[0045] The processor executes computer execution instructions stored in the memory to implement the method as described in any one of the first and / or second aspects.
[0046] Fifthly, this application provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, are used to implement the method as described in any one of the first and / or second aspects.
[0047] In a sixth aspect, this application provides a computer program product, including a computer program that, when executed by a processor, implements the method described in either the first aspect or the second aspect.
[0048] The dialogue data construction, voice interaction method, apparatus, device, and medium provided in this application allow for the initial dialogue text to exclude dialogue content between the first and second speakers within the target dialogue scenario. Using this initial dialogue text, target dialogue speech including the dialogue content of the target dialogue scenario can be constructed. Subsequently, target dialogue data can be obtained based on this target dialogue speech, thus ensuring that the target dialogue data includes dialogue data containing the dialogue content of the target dialogue scenario. Because the target dialogue data includes the dialogue content between the first and second speakers within the target dialogue scenario, the voice interaction model subsequently trained on a preset model based on this target dialogue data possesses the processing capability to adapt to the aforementioned interactive dialogue scenario. This improves the realism of voice interaction with users and enhances the user experience. Attached Figure Description
[0049] To more clearly illustrate the technical solutions in this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0050] Figure 1 A flowchart illustrating a dialogue data construction method provided in this application;
[0051] Figure 2 This application provides a flowchart illustrating a method for constructing target dialogue speech based on initial dialogue text.
[0052] Figure 3 This application provides a flowchart illustrating a method for obtaining target dialogue speech based on initial dialogue speech.
[0053] Figure 4 A flowchart illustrating a voice interaction method provided in this application;
[0054] Figure 5 A flowchart illustrating another dialogue data construction method provided in this application;
[0055] Figure 6 A schematic diagram of a dialogue data construction device provided in this application;
[0056] Figure 7 A schematic diagram of the structure of a voice interaction device provided in this application;
[0057] Figure 8 This is a schematic diagram of the hardware structure of an electronic device provided in this application.
[0058] The accompanying drawings illustrate specific embodiments of this application, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concept of this application to those skilled in the art through reference to particular embodiments. Detailed Implementation
[0059] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0060] The following is a brief explanation of some of the terms and concepts used in this application:
[0061] Large language models: Large models refer to deep learning models with a massive number of parameters, typically containing hundreds of millions, tens of billions, or even trillions of parameters. Large models are also known as foundation models (FM). They are pre-trained on large-scale unlabeled corpora, producing pre-trained models with hundreds of millions of parameters. These models can adapt to a wide range of downstream tasks and have good generalization ability. Examples include Large Language Models (LLMs) and Multi-modal Pre-training Models.
[0062] In practical applications, large models only require a small number of samples to fine-tune the pre-trained model before they can be applied to different tasks. Large models can be widely used in fields such as Natural Language Processing (NLP) and Computer Vision. Specifically, they can be applied to computer vision tasks such as Visual Question Answering (VQA), Image Captioning (IC), and Image Generation, as well as NLP tasks such as text-based sentiment classification, text summarization, and machine translation. The main application scenarios for large models include digital assistants, intelligent robots, search, online education, office software, e-commerce, and intelligent design.
[0063] Voice assistants and intelligent robots can engage in voice conversations with users using pre-trained voice interaction models. Training these models requires a sample dataset. Currently, the sample datasets used for training voice interaction models primarily include voice dialogue data and the corresponding transcribed text.
[0064] However, using existing voice dialogue data as a sample dataset for model training results in a poorly trained voice interaction model that lacks the ability to handle real-world interactive scenarios, leading to a poor user experience.
[0065] The inventors discovered through research that the reason why existing voice interaction models have the above problems is that the voice dialogue data used to train the model is usually just static dialogue data. As a result, the voice interaction model obtained by training the model based on this voice dialogue data can only meet static dialogue scenarios.
[0066] The aforementioned static dialogue data can refer to dialogue data in which different speakers do not interrupt each other and there is no interaction. For example, in existing voice dialogue data, after speaker 1 asks a question, speaker 2 answers. Before speaker 2 finishes answering, speaker 1 remains silent and does not interrupt speaker 2.
[0067] In view of this, this application proposes a method for constructing dynamic dialogue data. By using the constructed dynamic dialogue data for model training, a voice interaction model that can meet the needs of real-world interactive scenarios can be obtained, thereby improving the user experience. The dynamic dialogue data can be, for example, dialogue data from interactive scenarios including situations where a speaker is interrupted, multi-turn dialogues between different speakers, etc.
[0068] Optionally, the execution entity of the dialogue data construction method provided in this application can be any electronic device with processing capabilities, such as a terminal or a server. Alternatively, in some embodiments, the execution entity of the dialogue data construction method can also be a cloud platform or a cloud server.
[0069] The technical solutions of this application will be described in detail below with reference to specific embodiments. The following specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments.
[0070] Figure 1 This is a flowchart illustrating a dialogue data construction method provided in this application. Figure 1 As shown, the method may include the following steps:
[0071] S101. Obtain the initial dialogue data to be processed.
[0072] The initial dialogue data mentioned above may include: initial dialogue text. This initial dialogue text does not contain dialogue content from the target dialogue scenario.
[0073] The aforementioned target dialogue scenario can be an interaction between a first speaker and a second speaker. For example, this interaction scenario may include: the first speaker interrupting the second speaker, the first speaker and the second speaker engaging in multiple rounds of dialogue, the second speaker responding when the first speaker pauses to think, and other such interaction scenarios.
[0074] It should be understood that the aforementioned initial dialogue text may, for example, include the dialogue content of at least two speakers. The first speaker may be any one of the at least two speakers, and the second speaker may be any one of the at least two speakers other than the first speaker. For example, taking an initial dialogue text that includes the dialogue content of two speakers (A and B), assuming that the interaction scenario includes: the first speaker interrupting the second speaker, A interrupting B, and B interrupting A, all of these can be considered as target dialogue scenarios.
[0075] Optionally, the initial dialogue text may include the actual dialogue content of at least two speakers in reality. Alternatively, the initial dialogue text may also include dialogue content generated by a dialogue text generation model or similar method in a targetless dialogue scenario. Furthermore, the initial dialogue text may include the actual dialogue content of at least two speakers in reality, as well as dialogue content generated by a dialogue text generation model or similar method in a targetless dialogue scenario.
[0076] The initial dialogue text may include multiple dialogue segments, each of which may include, for example, one round (a question and answer can constitute one round of dialogue) or multiple rounds of dialogue. Taking a multi-round dialogue text as an example, the initial dialogue text could be as follows:
[0077] A: What time does today's meeting start?
[0078] B: The notice mentioned that it would start at 3 p.m.
[0079] A: Have you prepared the documents needed for the meeting?
[0080] B: Not yet, what about you?
[0081] ..."
[0082] In some embodiments, the initial dialogue data may be pre-stored in the electronic device. That is, the electronic device can read the initial dialogue data to be processed from its own stored data. Alternatively, the electronic device may also receive initial dialogue data to be processed input by the user. Optionally, the electronic device may obtain the aforementioned initial dialogue data to be processed through, for example, an Application Programming Interface (API) or a Graphical User Interface (GUI).
[0083] S102. Based on the initial dialogue text, construct the target dialogue speech, which includes the dialogue content in the target dialogue scenario.
[0084] Optionally, the dialogue content in the aforementioned target dialogue scenario can be, for example, the dialogue generated when the first speaker interrupts the second speaker, or the dialogue generated during a multi-turn dialogue between the first and second speakers. For example, the dialogue content in this target dialogue scenario could be as follows:
[0085] A: What time does today's meeting start?
[0086] B: The notification mentioned that [it was interrupted] in the afternoon...
[0087] A: Have you prepared the documents needed for the meeting?
[0088] B: I'm not ready yet. The meeting starts at 3 PM. Don't worry.
[0089] ..."
[0090] In some embodiments, the electronic device may, for example, generate text including dialogue content in the target dialogue scenario based on the initial dialogue text described above. Then, the electronic device may convert the text including dialogue content in the target dialogue scenario into speech and use the speech as the target dialogue speech.
[0091] S103. Obtain target dialogue data based on the target dialogue speech.
[0092] In some embodiments, the target dialogue data may include, for example, audio data, text data, etc. The audio data may be, for example, the target dialogue speech. The text data may be, for example, the text corresponding to the dialogue content of the target dialogue speech. Alternatively, in some embodiments, the target dialogue data may only include audio data.
[0093] Alternatively, in some embodiments, the electronic device may further adjust the target dialogue voice, such as adjusting the duration of the target dialogue voice, and use the adjusted voice as target dialogue data. Alternatively, the electronic device may also use the target dialogue voice as target dialogue data.
[0094] Optionally, the aforementioned target dialogue data can be used, for example, to train a preset model to obtain a voice interaction model. That is, the preset model can be trained based on the target dialogue data to obtain a trained model, which serves as the voice interaction model. Because the aforementioned target dialogue speech includes the dialogue content within the target dialogue scenario, the voice interaction model trained based on the target dialogue data can be used to perform voice interactions with the user within the aforementioned target dialogue scenario.
[0095] It should be understood that this application does not limit how the target dialogue data is used to train the preset model to obtain a voice interaction model. For example, an electronic device (optionally, the electronic device mentioned herein and the electronic device used to execute the dialogue data construction method provided in this application can be the same electronic device or different electronic devices) can use only the target dialogue data as the sample dataset required to train the preset model and train the preset model to obtain a voice interaction model. The electronic device can also acquire other dialogue data (e.g., speech of dialogue content in a scenario without a target dialogue), mix the other dialogue data with the target dialogue data, that is, use the other dialogue data and the target dialogue data as the sample dataset required to train the preset model and train the preset model to obtain a voice interaction model.
[0096] In this embodiment, the initial dialogue text may not contain the dialogue content of the target dialogue scenario where the first speaker and the second speaker interact. Using this initial dialogue text, a target dialogue speech including the dialogue content of the target dialogue scenario can be constructed. Then, target dialogue data can be obtained based on this target dialogue speech, thus ensuring that the target dialogue data includes the dialogue content of the target dialogue scenario. Because the target dialogue data includes the dialogue content of the first speaker and the second speaker interacting within the target dialogue scenario, the subsequent voice interaction model trained on the preset model based on this target dialogue data has the processing capability to adapt to the aforementioned interactive dialogue scenario. Therefore, the realism of the voice interaction with the user is improved, and the user experience is enhanced.
[0097] The following example, using the interaction scenario between the first and second speakers, where the first speaker interrupts the second speaker, will be used to explain in detail how an electronic device can construct target dialogue speech, including dialogue content within the target dialogue scenario, based on the initial dialogue text:
[0098] Figure 2 This application provides a flowchart illustrating a method for constructing target dialogue speech based on initial dialogue text. Figure 2 As shown, as one possible implementation, step S102 above may include, for example, the following steps:
[0099] S201. Insert an interruption marker at the target position of the initial dialogue text to obtain dialogue text including the interruption marker.
[0100] Optionally, the aforementioned target location is the position where the first speaker interrupts the second speaker. It should be understood that this application does not limit the location of this target location within the initial dialogue text. That is, the first speaker can interrupt the second speaker at any point, for example.
[0101] The interruption markers mentioned above can be used to indicate that the first speaker interrupts the second speaker at the target location.
[0102] In some embodiments, for any round of dialogue between any two speakers in the initial dialogue text, the electronic device can, for example, insert an interruption marker at the target position of that round of dialogue, resulting in dialogue text including the interruption marker. That is, the dialogue text including the interruption marker can be, for example, the result of adding the interruption marker to that round of dialogue. Alternatively, the electronic device can, for example, delete the dialogue content after the target position of the second speaker's speech in that round of dialogue and insert an interruption marker at that target position, resulting in dialogue text including the interruption marker. For example, exemplarily, still referring to the above example, the dialogue text including the interruption marker could be as follows: "The notification mentioned that it starts at 3 PM [interruption marker]", or, "The notification mentioned that it starts at 3 PM [interruption marker]".
[0103] In some embodiments, the electronic device may, for example, use an interruption flag processing model to insert an interruption flag at the target location of the initial dialogue text.
[0104] For example, an electronic device may first acquire a first prompt, and then input the first prompt into an interruption flag processing model to obtain dialogue text including interruption flags. This interruption flag processing model can be, for example, any existing large language model.
[0105] The aforementioned first prompt content may include, for example, the initial dialogue text, the first instruction information, and at least one of the following: an instance of dialogue text including an interruption marker, and the conditions that the target location must meet.
[0106] The aforementioned first instruction information can be used to instruct the insertion of an interruption marker into the initial dialogue text. This first instruction information enables the preset language model to determine that the task to be performed is to insert an interruption marker into the initial dialogue text, thereby improving the accuracy of inserting the interruption marker at the target position in the initial dialogue text using the preset language model, resulting in dialogue text including the interruption marker.
[0107] The aforementioned dialogue text instance including interruption markers enables the preset language model to generate dialogue text including interruption markers that conforms to the content shown in the dialogue text instance, thereby improving the accuracy of obtaining dialogue text including interruption markers through the preset language model.
[0108] The conditions that the target location must meet can be used to limit the factors determining the target location. For example, the conditions that the target location must meet could be ensuring the diversity of interruption locations, avoiding using the location where the speaker is about to end the conversation as the target location, and the frequency of interruption marker appearances. By using these conditions that the target location must meet, the pre-defined language model can determine the accurate target location based on these conditions and add an interruption marker at the target location, further improving the accuracy of interruption marker addition.
[0109] In some embodiments, the aforementioned first prompt may further include, for example, a format requirement for the dialogue text including interruption markers. This format requirement allows the electronic device to obtain dialogue text with interruption markers in the same format using a preset language model, thereby improving the efficiency of subsequent dialogue data construction based on dialogue text with interruption markers in the same format.
[0110] For example, taking the aforementioned first prompt content as including: initial dialogue text, the aforementioned first instruction information, a dialogue text instance including an interruption marker, the conditions that the target location must meet, and the format requirements of the dialogue text including the interruption marker, the aforementioned first prompt content can be as follows:
[0111] Please adapt the dialogue (i.e., the initial dialogue text mentioned above, which should be understood not to be shown in this first prompt content example) into a version that includes the interruption scenario (i.e., the first instruction information) according to the following requirements and dialogue content.
[0112] Strictly adhere to the following requirements (i.e., the conditions that the target location must meet):
[0113] 1. Mark interruption points: When the other person hasn't finished speaking or hasn't expressed key information, choose an appropriate time to interrupt, ensuring a variety of interruption points. Simply add a [interruption] marker after the word being interrupted. Avoid interrupting when the other person is about to finish.
[0114] 2. Interruptions should be highly relevant to the context and topic of the conversation and should help facilitate deeper dialogue or clarify key issues. Ensure that the transition of the interruption is natural.
[0115] 3. After an interruption, the interruptor is unaware of what the other person initially didn't finish saying. The unfinished content should be unrelated to the interruption and should be mentioned again depending on the context to ensure logical consistency and realism.
[0116] 4. Interruptions should not exceed two times to ensure a natural and smooth conversation.
[0117] 5. Ensure that each interruption is meaningful and valuable, and that the content is not repetitive. Maintain consistency between the content and tone of voice and the character's identity.
[0118] Example (i.e., a dialogue text instance including interrupt markers):
[0119] A: At yesterday's meeting, we discussed the new project plan and decided to begin implementation next quarter. The plan mainly involves resource allocation and team collaboration. We would like to [interrupt]…
[0120] B: Excuse me for bothering you, but I had a question on my mind. Is our current budget sufficient to support the implementation of this plan?
[0121] A: This plan does require more funding, but we've also considered some additional funding sources. However, I haven't yet mentioned our use of automation tools [interruption]...
[0122] B: Automation tools are certainly important, but I'm more concerned about whether our team's current skills are sufficient to meet the requirements of these tools.
[0123] Output requirements: Only output the modified and interrupted dialogue content; no other content should be output. Strictly adhere to the following format (including the format requirements for the interrupted dialogue text):
[0124] A: [Dialogue content]
[0125] B: [Dialogue content]
[0126] Optionally, the electronic device may receive the first prompt content input by the user, for example, through a GUI or API. Alternatively, taking the initial dialogue text pre-stored in the electronic device as an example, the electronic device may retrieve the initial dialogue text from its own stored data, and receive at least one of the following from the user: the first instruction information, a dialogue text instance including an interruption marker, the conditions to be met at the target location, and the format requirements of the dialogue text including the interruption marker. The electronic device then concatenates the initial dialogue text with the at least one of these contents to obtain the first prompt content.
[0127] Optionally, the electronic device can, for example, invoke the interruption flag processing model through its calling interface and input the aforementioned first prompt content into the interruption flag processing model to obtain dialogue text including the interruption flag.
[0128] Using the above method, dialogue text including interruption markers can be obtained through the first prompt content and the interruption marker processing model. Because the first prompt content is highly flexible in its changes, the dialogue text including interruption markers satisfies the conditions described by the first prompt content, thus improving the flexibility of obtaining the dialogue text including interruption markers.
[0129] In some embodiments, the electronic device may first randomly select a preset number of text segments (each text segment may include at least one round of dialogue) from the initial dialogue text. Then, for each of the preset number of text segments, the electronic device may, for example, randomly select a position within that text segment as a target position and insert an interruption marker at that target position to obtain dialogue text including the interruption marker. This method improves the randomness of the interruption marker's appearance in the initial dialogue text, thereby increasing the randomness of the dialogue text including the interruption marker, and thus enhancing the richness of the target dialogue data obtained based on the dialogue text including the interruption marker.
[0130] S202. Based on the dialogue text including the interruption marker, generate the target dialogue text at the target location where the first speaker interrupts the second speaker.
[0131] It should be understood that the first speaker can be either of the two speakers in the dialogue, while the second speaker is the other speaker besides the first speaker. For example, assuming the target dialogue text is the content of a conversation between speaker A and speaker B, then the target dialogue text may include the content of speaker A interrupting speaker B's conversation. Alternatively, the target dialogue text may include the content of speaker B interrupting speaker A's conversation. Furthermore, the target dialogue text may also include the content of speaker A interrupting speaker B's conversation, and the content of speaker B interrupting speaker A's conversation.
[0132] In some embodiments, an electronic device can, for example, use a text generation model to generate target dialogue text at a target location where a first speaker interrupts a second speaker, based on dialogue text including interruption markers. This text generation model can be, for example, any existing large language model. In some embodiments, the text generation model and the aforementioned interruption marker processing model can be the same large language model, or they can be different large language models.
[0133] Optionally, the electronic device may first acquire the second prompt content, and then input the second prompt content into the text generation model to obtain the target dialogue text "At the target location, the first speaker interrupts the second speaker".
[0134] The second prompt content may include, for example, the dialogue text including the interruption marker, the second instruction information, and at least one of the following: a dialogue text instance in the target dialogue scenario, and the conditions that the target dialogue text must meet.
[0135] The aforementioned second instruction information can be used to instruct the generation of target dialogue text at the target location, where the first speaker interrupts the second speaker, based on the dialogue text including the interruption marker. This second instruction information enables the text generation model to determine that the task to be performed is to generate target dialogue text at the target location, where the first speaker interrupts the second speaker, thus improving the accuracy of the text generation model in obtaining the target dialogue text.
[0136] The dialogue text instance in the above target dialogue scenario enables the text generation model to generate target dialogue text that conforms to the content shown in the dialogue text instance, thus improving the accuracy of the target dialogue text obtained by the text generation model.
[0137] The conditions that the target dialogue text must meet can, for example, be used to limit the fluency and naturalness of the dialogue content included in the target dialogue text. By meeting these conditions, the text generation model can generate target dialogue text that meets these conditions, thereby improving the accuracy of the target dialogue text.
[0138] In some embodiments, the second prompt may also include, for example, the format requirements of the target dialogue text. By specifying these format requirements, the electronic device can obtain target dialogue text of the same format through a text generation model, thereby improving the efficiency of subsequent dialogue data construction based on target dialogue text of the same format.
[0139] For example, taking the second prompt content as including: the dialogue text with the interruption marker, the second instruction information, an instance of the dialogue text in the target dialogue scenario, and the conditions that the target dialogue text must meet, the second prompt content can be as follows:
[0140] Please fill in the sentences that include the interrupt mark based on the dialogue content to ensure the naturalness and logical coherence of the dialogue (i.e., the second instruction above).
[0141] Requirements (i.e., the conditions that the target dialogue text must meet):
[0142] 1. Fill in the blanks after the interruption markers appropriately to make the dialogue flow naturally. The continuation must be meaningful.
[0143] 2. Ensure that the continued dialogue is natural and fluent, and conforms to the actual scenario and interpersonal communication norms.
[0144] 3. While continuing, retain the [interruption] marker and treat subsequent content as unspoken. The interruptor is unaware of this content.
[0145] Example (i.e., a dialogue text instance in the target dialogue scenario):
[0146] A: At yesterday's meeting, we discussed the new project plan and decided to begin implementation next quarter. We'd like to [interrupt]...
[0147] B: Excuse me for bothering you, but I had a question on my mind. Is our current budget sufficient to support the implementation of this plan?
[0148] Continued:
[0149] A: We hope to [interrupt] this process to improve resource utilization and manage our resources more effectively.
[0150] B: Excuse me for bothering you, but I had a question on my mind. Is our current budget sufficient to support the implementation of this plan?
[0151] Output requirements (i.e., the format requirements of the target dialogue text): Only output the content for continuing the dialogue, strictly adhering to the following format:
[0152] A: [Dialogue content]
[0153] B: [Dialogue content]
[0154] Optionally, the electronic device may receive the second prompt content input by the user, for example, through a GUI or API. Alternatively, taking the electronic device storing the dialogue text including the interruption marker after receiving it as an example, the electronic device may retrieve the dialogue text including the interruption marker from its own stored data, and receive at least one of the following: the second instruction information input by the user, a dialogue text instance in the target dialogue scenario, and conditions that the target dialogue text must meet. The electronic device then concatenates the dialogue text including the interruption marker with the at least one of these contents to obtain the second prompt content.
[0155] Using the above method, based on the second prompt content and the text generation model, the target dialogue text in which the first speaker interrupts the second speaker at the target location can be obtained. Because by changing the second prompt content, the text generation model can make the target dialogue text meet different conditions, thus improving the flexibility of obtaining the target dialogue text.
[0156] In some embodiments, the electronic device may further input the dialogue text including the interruption marker into a pre-trained text rewriting model to obtain target dialogue text in which the first speaker interrupts the second speaker at a target location. The text rewriting model may, for example, be trained based on sample dialogue text and sample labels. The preset location of the sample dialogue text may include an interruption marker. The sample label may be the dialogue content in which the first speaker interrupts the second speaker at that preset location.
[0157] S203. Perform speech conversion on the target dialogue text to obtain the target dialogue speech.
[0158] For example, the electronic device can use any existing Text-to-Speech (TTS) model (a technique that uses a model and specified rules to convert text into speech) to convert the target dialogue text into speech, thereby obtaining the target dialogue speech. For example, the electronic device can use the TTS model to ensure that the speaker's voice features in terms of pitch, speech rate, and timbre in the target dialogue speech are distinct, and that the voice features of different speakers (or roles) are different, thereby improving the realism of the target dialogue speech.
[0159] In this embodiment, interruption markers are inserted into the initial dialogue text of dialogue content without a target dialogue scenario, resulting in dialogue text including the interruption markers. Then, based on this dialogue text including the interruption markers, target dialogue text is generated where the first speaker interrupts the second speaker at the target location, thus constructing dialogue content from dialogue content without a target dialogue scenario to dialogue content with a target dialogue scenario. Then, by performing speech-to-speech conversion on this target dialogue text, target dialogue speech with the target dialogue scenario can be obtained, laying the foundation for constructing target dialogue data that can be used for model training, enabling the model to adapt to target dialogue scenarios with interruptions.
[0160] As another possible implementation, the electronic device can also acquire third prompt content, then input the third prompt content into a large language model to obtain the target dialogue text with the target dialogue scenario, and then perform speech conversion on the target dialogue text to obtain the target dialogue speech.
[0161] The third prompt may include, for example, the initial dialogue text and third instruction information. This third instruction information may be used to instruct the large language model to generate dialogue content that interrupts the interaction based on the semantics of the initial dialogue text. This dialogue content can then serve as the target dialogue text.
[0162] In some embodiments, the initial dialogue data may further include audio data, such as initial dialogue voice. In this implementation, the electronic device may also acquire target dialogue data based on the initial dialogue voice to improve the authenticity and richness of the target dialogue data.
[0163] Optionally, the initial dialogue voice can be, for example, a multi-turn dialogue voice. Alternatively, the aforementioned initial dialogue text can also be a multi-turn dialogue text. Furthermore, the initial dialogue voice is a multi-turn dialogue voice, and the initial dialogue text is also a multi-turn dialogue text. In this implementation, the interaction scenario between the first speaker and the second speaker may further include, for example, an interaction scenario where the first speaker and the second speaker engage in a multi-turn dialogue.
[0164] When the initial dialogue voice is a multi-turn dialogue voice and / or the initial dialogue text is a multi-turn dialogue text, the target dialogue voice determined based on the multi-turn dialogue voice and / or multi-turn dialogue text can also have multi-turn dialogue content. Therefore, the voice interaction model trained based on the target dialogue data can have the ability to process multi-turn interactions, which can further improve the realism of the interaction between the voice interaction model and the user, thereby further improving the user experience.
[0165] Alternatively, in some embodiments, the initial dialogue audio may also contain single-turn dialogue segments. The initial dialogue text may also contain, for example, single-turn dialogue content.
[0166] The initial dialogue data mentioned above also includes the initial dialogue audio, and the interaction scenario between the first speaker and the second speaker includes, for example, the first speaker interrupting the second speaker. Figure 3 This application provides a flowchart illustrating a method for obtaining target dialogue speech based on initial dialogue speech. Figure 3 As shown, as one possible implementation, before obtaining the target dialogue data based on the target dialogue voice, the electronic device may, for example, perform the following steps:
[0167] S301. Obtain overlapping speech in the initial dialogue speech where there is overlap between the speakers.
[0168] Speaker voice overlap occurs when a second speaker is speaking at the same time as the first speaker. For example, if the second speaker interrupts the first speaker, it will cause voice overlap between the first and second speakers. Alternatively, if the second speaker makes a sound (such as "uh-huh" or "oh") while the first speaker is speaking, it will also cause voice overlap between the first and second speakers.
[0169] In some embodiments, the aforementioned dialogue overlap speech may also be referred to as dialogue overlap speech segments.
[0170] In some embodiments, the electronic device may first perform speaker separation on the initial dialogue speech to obtain speaker separation results. These speaker separation results can be used to characterize the speech segments corresponding to each speaker in the initial dialogue speech.
[0171] For example, an electronic device can input the initial dialogue speech described above into a speaker separation model to obtain speaker separation results. Optionally, the speaker separation model can be any existing speaker separation model, and this application does not limit how the speaker separation model performs speaker separation.
[0172] Then, based on the speaker separation result, the electronic device can determine the overlapping speech in the dialogue where there is speaker speech overlap.
[0173] For example, taking the above-mentioned speaker separation result as a dual-track audio file, in the dual-track audio file, each speaker's voice occupies one track, and the voices are arranged on the track in chronological order. Then, the electronic device can, for example, regard the audio segments of different speakers in the dual-track audio file that overlap in time as dialogue overlapping audio with overlapping speaker voices.
[0174] Alternatively, the speaker separation model described above can also be a model with overlapping speech detection capabilities, which allows electronic devices to determine the overlapping speech in the dialogue based on the speaker separation results.
[0175] Using the above method, the speakers in the initial dialogue can be separated, thus separating the speech content of different speakers in the initial dialogue. This allows for the identification of overlapping dialogue speech, laying the foundation for obtaining the target dialogue speech in the interrupted target dialogue scenario based on the overlapping dialogue speech.
[0176] In some embodiments, the electronic device may further input the aforementioned initial dialogue speech into a pre-trained overlapping speech segment recognition model to determine whether there are overlapping speech segments in the initial dialogue speech. For example, the pre-trained overlapping speech segment recognition model may be trained on a sample dataset that includes sample dialogue speech and sample labels. The sample labels may be used to indicate semantic segments with overlapping speech segments in the sample dialogue speech. Therefore, by training the model based on the sample dialogue speech and sample labels, the overlapping speech segments in the initial dialogue speech can be determined based on the input initial dialogue speech.
[0177] S302. Obtain the target dialogue speech based on overlapping dialogue speech.
[0178] As mentioned above, overlapping speech can occur when the first speaker interrupts the second speaker, or when the second speaker makes brief interjections such as "uh-huh" or "oh-oh" while the first speaker is speaking. Therefore, electronic devices can optionally filter out overlapping speech in non-interrupted scenarios.
[0179] The inventors discovered through research that in overlapping speech during conversations, consistent timbre between different speakers typically appears in the context of modal particles, such as the aforementioned "um" and "oh." Inconsistent timbre usually indicates interrupting interactions between different speakers. Therefore, in some embodiments, electronic devices can, for example, filter out overlapping speech in non-interrupting scenarios based on the similarity of timbre between different speakers.
[0180] Optionally, the electronic device may first obtain the timbre similarity of different speakers in the overlapping speech of the dialogue.
[0181] Optionally, the electronic device can obtain the above-mentioned timbre similarity by referring to any existing method for obtaining the timbre similarity between different speakers, such as using a voice consistency model to detect the cosine similarity between speakers as the timbre similarity between different speakers.
[0182] Then, the electronic device can, for example, filter out speech segments with a timbre similarity greater than or equal to a preset similarity threshold to obtain filtered overlapping speech.
[0183] When the similarity of the timbre of different speakers is greater than or equal to the preset similarity threshold, it means that the speech segment is a dialogue overlap produced by one speaker echoing another speaker through the sound of interjections, etc., without interrupting the other speaker. Therefore, the speech segment can be filtered out.
[0184] Optionally, the aforementioned preset similarity threshold may be, for example, pre-stored in the electronic device.
[0185] Alternatively, the electronic device can also acquire preset interjection text. The electronic device can, for example, convert the overlapping speech into text to obtain overlapping speech text. Then, the electronic device can, for example, delete the overlapping speech text containing the aforementioned interjection text to obtain filtered overlapping speech text. Then, the electronic device can, for example, use a text-to-speech conversion model to convert the filtered overlapping speech text into speech to obtain the aforementioned filtered overlapping speech.
[0186] Electronic devices can, for example, obtain the target dialogue speech based on the filtered overlapping dialogue speech.
[0187] Optionally, the electronic device may, for example, use the filtered overlapping dialogue speech as the target dialogue speech.
[0188] Alternatively, electronic devices may also determine, for example, the speech segments in the initial speech that meet preset conditions based on the filtered overlapping speech of the dialogue, and use them as the target speech.
[0189] The aforementioned preset conditions may include, for example, at least one of the following: the speech segment includes the filtered overlapping speech of the dialogue, or the speech segment makes the duration of the filtered overlapping speech of the dialogue greater than or equal to a preset duration.
[0190] Taking the aforementioned preset conditions, including the inclusion of the filtered overlapping dialogue speech in the speech segment, as an example, the electronic device can, for instance, use the filtered overlapping dialogue speech as a starting point, and from the initial dialogue speech, the speech within a first duration before the starting point and the speech within a second duration after the starting point as the target dialogue speech, so that the target dialogue speech covers the filtered overlapping dialogue speech. The first and second durations may be equal or unequal; this application does not impose any limitation on this.
[0191] Alternatively, the electronic device may use the filtered overlapping dialogue speech as a starting point to determine, from the initial dialogue speech, the first silent segment closest to the starting point before the starting point and the second silent segment closest to the starting point after the starting point, and use the speech segment between the first silent segment and the second silent segment as the target dialogue speech.
[0192] Alternatively, the electronic device may use the filtered overlapping dialogue as a starting point to determine, from the initial dialogue, a first position before the starting point, a first silence point closest to the first position, and a second position after the starting point, a second silence point closest to the second position. The electronic device may, for example, use the speech segment between the first silence point and the second silence point as the target dialogue.
[0193] Taking the aforementioned preset conditions, including that the duration of the filtered overlapping speech in the speech segment is greater than or equal to a preset duration, for example, the preset duration could be, for instance, 1 second. By ensuring that the duration of the filtered overlapping speech in the target speech is greater than or equal to the preset duration, overlapping speech in non-interrupted scenarios is further filtered, thus further improving the accuracy of the target speech including the dialogue content of the target dialogue scenario.
[0194] In this embodiment, the target dialogue voice can be obtained based on the initial dialogue voice, thus increasing the richness of the content included in the target dialogue voice, and consequently improving the accuracy of the voice interaction model obtained by training the model based on the target dialogue voice. By first identifying dialogue overlap voices from the initial dialogue voice, preliminary screening of dialogue content including interruption scenarios is achieved, allowing subsequent acquisition of the target dialogue voice based on the dialogue overlap voices without needing to process the entire initial dialogue voice, thereby improving the efficiency of target dialogue voice acquisition.
[0195] The following section provides a detailed explanation of how electronic devices obtain target dialogue data based on the target dialogue voice:
[0196] As one possible implementation, the electronic device can, for example, first adjust the duration of the target dialogue speech and the time interval between different speakers based on the timestamps corresponding to the speech content of each speaker in the target dialogue speech, to obtain the adjusted target dialogue speech.
[0197] Optionally, the timestamps mentioned above can be used to indicate the start and end times of a speaker's speech. For example, an electronic device can obtain the timestamps corresponding to the speech content of each speaker in the target dialogue speech through the aforementioned TTS model.
[0198] For example, electronic devices can, based on the timestamp and random functions, randomly adjust the duration of the target dialogue voice (e.g., by cropping voice segments) and the time interval between different speakers, in order to improve the randomness and realism of the adjusted target dialogue voice.
[0199] Then, the electronic device can, for example, obtain the aforementioned target dialogue data based on the adjusted target dialogue voice.
[0200] Optionally, the electronic device can directly use the adjusted target dialogue voice as target dialogue data.
[0201] Alternatively, the electronic device may also include the text corresponding to the adjusted target dialogue speech, as well as the adjusted target dialogue speech, as the aforementioned target dialogue data. Furthermore, the target dialogue data may also include, for example, timestamps corresponding to the speech content of each speaker in the adjusted target dialogue speech.
[0202] In this embodiment, the target dialogue speech is adjusted by using the timestamps corresponding to the speech content of each speaker in the target dialogue speech, which ensures that the temporal sequence is not disordered during the adjustment process, improves the accuracy of the adjusted target dialogue speech, and improves the standardization of the target dialogue data by obtaining the target dialogue data based on the adjusted target dialogue speech, thus improving the efficiency of subsequent model training based on the target dialogue data.
[0203] The voice interaction model trained based on the aforementioned target dialogue data can be used for voice interaction. For example, Figure 4 This is a flowchart illustrating a voice interaction method provided in this application.
[0204] Optionally, the entity executing the voice interaction method provided in this application can be any electronic device with processing capabilities, such as a terminal or a server. This electronic device can be the same as or different from the aforementioned electronic device used to execute the dialogue data construction method. For example, this electronic device can be a mobile phone, a computer, or an intelligent interactive robot.
[0205] In some embodiments, the entity executing the voice interaction method may also be a voice interaction system. This voice interaction system may include, for example, a front-end application and a back-end server. Alternatively, the entity executing the voice interaction method may also be a target application with voice interaction capabilities. This target application may be deployed on user terminals such as mobile phones or computers. For example, the target application may be a smart voice assistant or other downloadable and installable applications.
[0206] The following example uses a voice interaction system as the executing entity of this voice interaction method to illustrate the method. Figure 4 As shown, the method may include the following steps:
[0207] S401, Receive user voice.
[0208] For example, the voice interaction system can receive the user's voice input via a front-end application. Taking the front-end application deployed on a mobile phone as an example, the voice interaction system can receive the user's voice via the user's mobile phone's microphone.
[0209] Alternatively, the implementation of receiving user voice in voice interaction can refer to any existing voice interaction method, which will not be elaborated here.
[0210] S402. Input the user's voice into the pre-trained voice interaction model to obtain the feedback voice in the target dialogue scenario.
[0211] The target dialogue scenario is the interaction between the first speaker and the second speaker. The voice interaction model is obtained by training a preset model based on the target dialogue data described in any of the foregoing embodiments.
[0212] It should be understood that the sample dataset used to train the voice interaction model may include the target dialogue data described in any of the foregoing embodiments, so that the voice interaction model has the ability to process target dialogue scenarios. Alternatively, in some embodiments, the sample dataset may also include, for example, dialogue data containing dialogue content in non-target dialogue scenarios, so that the voice interaction model trained based on the sample dataset can have the ability to process dialogues in non-target dialogue scenarios, thereby improving the realism of voice interaction and enhancing user experience.
[0213] Optionally, taking the interaction scenario between the first speaker and the second speaker as an example, where the first speaker interrupts the second speaker, if the first speaker is a user, the second speaker can be the voice interaction system. If the first speaker is the voice interaction system, the second speaker can be the user. The feedback voice in the above target dialogue scenario can be determined by the voice interaction system when the user interrupts its voice output. Alternatively, the feedback voice in the target dialogue scenario can also be a feedback voice determined by the voice interaction system based on the user's voice to interrupt the user. In other words, the above target dialogue scenario can be either the voice interaction system interrupting the user or the user interrupting the voice interaction system.
[0214] S403, Output feedback voice.
[0215] For example, the voice interaction system can output the aforementioned feedback voice through a user terminal so that the user can hear the voice interaction feedback. Optionally, the implementation of the voice interaction system outputting feedback voice can refer to any existing voice output method, which will not be elaborated here.
[0216] In this embodiment, a voice interaction model trained based on target dialogue data that includes dialogue content within the target dialogue scenario interacts with the user. This allows the voice interaction system to output user feedback voice when the user interrupts the system, or to output feedback voice when the system needs to interrupt the user. This improves the realism of the voice interaction between the user and the system and enhances the user experience.
[0217] The following example illustrates the dialogue data construction method provided in this application, using the target dialogue speech obtained from the initial dialogue text as synthetic data and the target dialogue speech obtained from the initial dialogue speech as real data:
[0218] Figure 5 A flowchart illustrating another dialogue data construction method provided in this application. (For example...) Figure 5 As shown, the process for acquiring synthetic data can be illustrated as follows:
[0219] 1. Obtain the original multi-turn dialogue text, i.e. the initial dialogue text mentioned above.
[0220] When acquiring synthetic data, the system first receives raw, uninterrupted multi-turn dialogue text as input. This raw multi-turn dialogue text can be designed with a pre-defined dialogue scenario, containing multiple dialogue rounds (i.e., multi-turn dialogue), but without any interruptions or interactive information. After being input into the system (i.e., the dialogue data construction system), the text can be preprocessed to ensure that each character's dialogue segments are clearly distinguishable. Preprocessing stages may include, for example, text formatting standardization.
[0221] 2. Generate target dialogue text, including interruption scenarios, using a large language model.
[0222] In the first stage, breakpoints are inserted into the original multi-turn dialogue text using a large language model. For example, the breakpoint can be defined by the first prompt, ensuring that it is inserted in a reasonable and natural position based on the context.
[0223] In the second stage, a large language model is used to generate new dialogue content based on the inserted breakpoints. For example, a second prompt can be used to ensure that the content after the break is consistent with the context and follows the rhythm of the dialogue, thus making the breakpoints less abrupt and increasing interactivity.
[0224] 3. Use a TTS model to perform modal conversion and generate the target dialogue speech.
[0225] After generating the target dialogue text, including the interruption scene, a pre-trained TTS model can be invoked to convert the text into speech. In some embodiments, the TTS model not only generates the speech for each character but also ensures accuracy in terms of intonation, speech rate, and timbre. For multi-character dialogues, the TTS model can generate unique speech features for each character to ensure the realism of the dialogue. Furthermore, the TTS model can generate timestamps for each audio segment, laying the foundation for subsequent speech splicing alignment operations.
[0226] 4. Post-processing, namely the splicing and processing of speech segments.
[0227] After the target dialogue speech is generated, multiple audio segments within it can be post-processed. Post-processing may include, for example, splicing speech segments from different characters based on timestamps, ensuring the continuity and fluency of the dialogue. In some embodiments, the duration and intervals of the speech segments can also be adjusted to ensure natural transitions when interrupting scenes. The aforementioned target dialogue data can be a dual-track audio file containing the dialogue content (i.e., character voices) of all characters, with the dialogue content, timestamps, and dialogue text precisely aligned.
[0228] like Figure 5 As shown, the process for obtaining real data can be illustrated as follows:
[0229] 1. Obtain the audio signals of the multi-turn dialogue, namely the initial dialogue voice mentioned above.
[0230] The initial dialogue audio mentioned above can be, for example, real-world multi-turn dialogue audio data, in formats such as WAV or MP3. The initial dialogue audio can include the statements of multiple speakers. In some embodiments, preprocessing operations such as audio format normalization and background noise removal may also be performed.
[0231] 2. Speaker separation and overlap detection
[0232] For example, a speaker separation model can be used to separate the speech segments of multiple speakers in a dialogue and detect the overlapping parts of the speakers' speech in the dialogue.
[0233] 3. Interrupting scene selection
[0234] The process involves filtering out non-interrupted scenes from overlapping segments and selecting interrupted scenes. Specific examples can be found in the methods described in the preceding embodiments, and will not be repeated here.
[0235] 4. Post-processing, namely the splicing and processing of speech segments.
[0236] The post-processing method can be referred to the method described in the foregoing embodiments, and will not be repeated here.
[0237] The target dialogue data obtained through the dialogue data construction method provided in this application includes both synthetic and real data, improving the quality of the dialogue data and thus enhancing the naturalness of the speech interaction model trained based on this target dialogue data in interactive scenarios. In experiments, the Mean Opinion Score (MOS, a metric for subjectively evaluating speech quality) of the target dialogue data provided in this application reached 4.35 (95% confidence interval), demonstrating the high accuracy of the target dialogue data in generating natural speech.
[0238] In this embodiment, a multi-stage prompting mechanism (i.e., the first and second prompting contents mentioned above) is used to generate interruption and interaction content through a phased prompting large language model, ensuring the naturalness and coherence of the generated dialogue. Interruption scenarios are filtered out by speaker overlap detection and timbre similarity. In real data processing, overlapping speech areas are accurately identified, and real interaction scenarios are determined by detecting differences in speaker timbre, filtering out overlapping speech that does not belong to interruption scenarios, thus improving the accuracy of data filtering and the quality of the target dialogue speech. By acquiring the target dialogue data of dual-track speech and annotating it with precise timestamps, the alignment of speech segments in the target dialogue data with text content and role information is ensured, thereby improving the flexibility and accuracy of multi-turn dialogue generation.
[0239] Figure 6 This is a schematic diagram of a dialogue data construction device provided in this application. Figure 6 As shown, the dialogue data construction device may include: an acquisition module 61 and a construction module 62. Wherein,
[0240] The acquisition module 61 is used to acquire initial dialogue data to be processed. The initial dialogue data includes: initial dialogue text, which does not contain dialogue content under the target dialogue scenario, wherein the target dialogue scenario is: the interaction scenario between the first speaker and the second speaker.
[0241] The construction module 62 is used to construct target dialogue speech, including dialogue content under the target dialogue scenario, based on the initial dialogue text; and to obtain target dialogue data based on the target dialogue speech. The target dialogue data is used to train a preset model to obtain a voice interaction model, which is used to perform voice interaction with the user under the target dialogue scenario.
[0242] Taking the interaction scenario between the first speaker and the second speaker as an example: the first speaker interrupts the second speaker, optionally, the construction module 62 is specifically used to insert an interruption mark at the target position of the initial dialogue text to obtain dialogue text including the interruption mark; based on the dialogue text including the interruption mark, generate target dialogue text at the target position where the first speaker interrupts the second speaker; and perform speech conversion on the target dialogue text to obtain the target dialogue speech.
[0243] Optionally, module 62 is specifically used to obtain first prompt content; input the first prompt content into the interruption flag processing model to obtain the dialogue text including the interruption flag. The first prompt content includes: the initial dialogue text, first indication information, and at least one of the following: a dialogue text instance including the interruption flag, and the condition that the target position must satisfy; the first indication information is used to indicate that the interruption flag is inserted into the initial dialogue text.
[0244] Optionally, the construction module 62 is specifically used to obtain the second prompt content, input the second prompt content into the text generation model, and obtain the target dialogue text. The second prompt content includes: the dialogue text including the interruption marker, second instruction information, and at least one of the following: a dialogue text instance in the target dialogue scenario, and the conditions that the target dialogue text must satisfy; the second instruction information is used to instruct the generation of target dialogue text at the target location, based on the dialogue text including the interruption marker, whereby the first speaker interrupts the second speaker.
[0245] Taking the interaction scenario between the first speaker and the second speaker as an example: the first speaker interrupts the second speaker, optionally, the initial dialogue data also includes: initial dialogue voice as an example. The construction module 62 can also be used to obtain dialogue overlapping voices in the initial dialogue voices before obtaining the target dialogue data based on the target dialogue voices; and obtain the target dialogue voices based on the dialogue overlapping voices.
[0246] Optionally, the construction module 62 is specifically used to obtain the timbre similarity of different speakers in the overlapping dialogue speech; filter out speech segments whose timbre similarity is greater than or equal to a preset similarity threshold to obtain filtered overlapping dialogue speech; and obtain the target dialogue speech based on the filtered overlapping dialogue speech.
[0247] Optionally, the construction module 62 is specifically used to determine, based on the filtered overlapping dialogue speech, a speech segment in the initial dialogue speech that meets preset conditions, as the target dialogue speech. The preset conditions include at least one of the following: the speech segment includes the filtered overlapping dialogue speech; the speech segment causes the duration of the filtered overlapping dialogue speech to be greater than or equal to a preset duration.
[0248] Optionally, the construction module 62 is specifically used to perform speaker separation on the initial dialogue speech to obtain speaker separation results; the speaker separation results are used to characterize the speech segments corresponding to each speaker in the initial dialogue speech; based on the speaker separation results, overlapping speech segments in the dialogue with overlapping speaker speech are determined.
[0249] Optionally, the interaction scenario between the first speaker and the second speaker may further include: the first speaker and the second speaker engaging in multi-turn dialogue, wherein the initial dialogue voice is multi-turn dialogue voice, and / or, the initial dialogue text is multi-turn dialogue text.
[0250] Optionally, the construction module 62 is specifically used to adjust the duration of the target dialogue voice and the time interval between different speakers based on the timestamps corresponding to the speech content of each speaker in the target dialogue voice, so as to obtain the adjusted target dialogue voice; and to obtain the target dialogue data based on the adjusted target dialogue voice.
[0251] The dialogue data construction apparatus provided in this application is used to execute the aforementioned dialogue data construction method embodiment. Its implementation principle and technical effect are similar, and will not be described in detail here.
[0252] Figure 7 This is a structural schematic diagram of a voice interaction device provided in this application. Figure 7 As shown, the voice interaction device may include: a receiving module 71, a processing module 72, and an output module 73. Among them,
[0253] The receiving module 71 is used to receive user voice.
[0254] Processing module 72 is used to input the user's voice into a pre-trained voice interaction model to obtain feedback voice in the target dialogue scenario. The voice interaction model is obtained by training a preset model based on the target dialogue data described in any of the foregoing embodiments; the target dialogue scenario is an interaction scenario between a first speaker and a second speaker.
[0255] Output module 73 is used to output the feedback voice.
[0256] Optionally, taking the application of the voice interaction device in a voice interaction system as an example, the interaction scenarios between the first speaker and the second speaker include: the first speaker interrupting the second speaker; when the first speaker is a user, the second speaker is the voice interaction system; when the first speaker is the voice interaction system, the second speaker is the user.
[0257] The voice interaction device provided in this application is used to execute the aforementioned voice interaction method embodiments. Its implementation principle and technical effect are similar, and will not be described in detail here.
[0258] Figure 8 This is a schematic diagram of the hardware structure of an electronic device provided in this application. Figure 8 The illustrated electronic device 80 includes a memory 81, a processor 82, and a communication interface 83. The memory 81, processor 82, and communication interface 83 are communicatively connected to each other. For example, the memory 81, processor 82, and communication interface 83 can be connected via a network. Alternatively, the electronic device 80 may also include a bus 84. The memory 81, processor 82, and communication interface 83 are communicatively connected to each other via the bus 84. Figure 8 It is an electronic device 80 in which the memory 81, processor 82, and communication interface 83 are connected to each other via bus 84.
[0259] The memory 81 can be a read-only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 81 can store programs, and when the program stored in the memory 81 is executed by the processor 82, the processor 82 and the communication interface 83 are used to execute the dialogue data construction method and / or voice interaction method described in any of the foregoing embodiments. The memory can also store data required by the dialogue data construction method and / or voice interaction method.
[0260] The processor 82 can be a general-purpose CPU, microprocessor, application-specific integrated circuit (ASIC), graphics processing unit (GPU), or one or more integrated circuits.
[0261] The processor 82 can also be an integrated circuit chip with signal processing capabilities. In implementation, the dialogue data construction method and / or voice interaction method of this application can be completed through the integrated logic circuits in the hardware of the processor 82 or through software instructions. The aforementioned processor 82 can also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components, capable of implementing or executing the methods, steps, and logic block diagrams disclosed in the following embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the methods disclosed in the following embodiments of this application can be directly embodied in the execution of a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software modules can reside in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other mature storage media in the art. The storage medium is located in memory 81. Processor 82 reads the information in memory 81 and combines its hardware to complete the dialogue data construction method and / or voice interaction method of this application.
[0262] The communication interface 83 uses transceiver modules, such as, but not limited to, transceivers, to enable communication between the electronic device 80 and other devices or communication networks. For example, a dataset can be acquired through the communication interface 83.
[0263] When the aforementioned electronic device 80 includes a bus 84, the bus 84 may include a path for transmitting information between various components of the electronic device 80 (e.g., memory 81, processor 82, communication interface 83).
[0264] This application also provides a computer-readable storage medium, which may include various media capable of storing program code, such as a USB flash drive, a portable hard drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk. Specifically, the computer-readable storage medium stores program instructions, which are used in the methods described in the above embodiments.
[0265] This application also provides a program product including execution instructions stored in a readable storage medium. At least one processor of an electronic device can read the execution instructions from the readable storage medium, and the execution of the execution instructions by the at least one processor causes the electronic device to implement the dialogue data construction method and / or voice interaction method provided in the various embodiments described above.
[0266] The term "multiple" in this document refers to two or more. The term "and / or" is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, or B alone. Furthermore, the character " / " in this document generally indicates an "or" relationship between the preceding and following related objects; in formulas, " / " indicates a "division" relationship. Additionally, it should be understood that in the description of this application, words such as "first" and "second" are used only for descriptive purposes and should not be construed as indicating or implying relative importance or order.
[0267] It is understood that the various numerical designations used in the embodiments of this application are merely for descriptive convenience and are not intended to limit the scope of the embodiments of this application.
[0268] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this application.
Claims
1. A method for constructing dialogue data, characterized in that, The method includes: Acquire initial dialogue data to be processed; the initial dialogue data includes: initial dialogue text, the initial dialogue text does not contain dialogue content under the target dialogue scenario, the target dialogue scenario is: the interaction scenario between the first speaker and the second speaker; Based on the initial dialogue text, construct a target dialogue speech that includes the dialogue content in the target dialogue scenario; Based on the target dialogue voice, target dialogue data is obtained; the target dialogue data is used to train a preset model to obtain a voice interaction model, and the voice interaction model is used to conduct voice interaction with the user in the target dialogue scenario.
2. The method according to claim 1, characterized in that, The interaction scenario between the first speaker and the second speaker includes: the first speaker interrupting the second speaker; the construction of target dialogue speech including dialogue content under the target dialogue scenario based on the initial dialogue text includes: Insert an interruption marker at the target position of the initial dialogue text to obtain dialogue text including the interruption marker; Based on the dialogue text including the interruption marker, generate a target dialogue text at the target location where the first speaker interrupts the second speaker. The target dialogue text is converted into speech to obtain the target dialogue speech.
3. The method according to claim 2, characterized in that, The step of inserting an interruption marker at the target position of the initial dialogue text to obtain dialogue text including the interruption marker includes: Obtain the first prompt content; the first prompt content includes: the initial dialogue text, the first instruction information, and at least one of the following: including an instance of the dialogue text with an interruption marker, and the condition that the target position must meet; the first instruction information is used to indicate that the interruption marker is inserted into the initial dialogue text; The first prompt content is input into the interruption flag processing model to obtain the dialogue text including the interruption flag.
4. The method according to claim 2 or 3, characterized in that, The step of generating target dialogue text at the target location, based on the dialogue text including the interruption marker, whereby the first speaker interrupts the second speaker, includes: Obtain the second prompt content, which includes: the dialogue text including the interruption mark, the second instruction information, and at least one of the following: the dialogue text instance in the target dialogue scenario, and the conditions that the target dialogue text must meet; the second instruction information is used to instruct the generation of a target dialogue text at the target location, based on the dialogue text including the interruption mark, in which the first speaker interrupts the second speaker. The second prompt is input into the text generation model to obtain the target dialogue text.
5. The method according to any one of claims 1-3, characterized in that, The interaction scenario between the first speaker and the second speaker includes: the first speaker interrupting the second speaker; the initial dialogue data further includes: initial dialogue speech; and before obtaining the target dialogue data based on the target dialogue speech, the method further includes: Obtain overlapping speech from the initial dialogue; Based on the overlapping speech in the dialogue, the target dialogue speech is obtained.
6. The method according to claim 5, characterized in that, The step of obtaining the target dialogue speech based on the overlapping dialogue speech includes: Obtain the timbre similarity of different speakers in the overlapping speech of the dialogue; Speech segments with timbre similarity greater than or equal to a preset similarity threshold are filtered out to obtain filtered overlapping speech segments. Based on the filtered overlapping speech, the target speech is obtained.
7. The method according to claim 6, characterized in that, The step of obtaining the target dialogue speech based on the filtered overlapping dialogue speech includes: Based on the filtered overlapping dialogue speech, a speech segment in the initial dialogue speech that meets a preset condition is determined as the target dialogue speech; the preset condition includes at least one of the following: the speech segment includes the filtered overlapping dialogue speech, and the speech segment makes the duration of the filtered overlapping dialogue speech greater than or equal to a preset duration.
8. The method according to claim 5, characterized in that, The step of obtaining overlapping speech in the initial dialogue speech, where there is speaker overlap, includes: Speaker segmentation is performed on the initial dialogue speech to obtain speaker segmentation results; the speaker segmentation results are used to characterize the speech segments corresponding to each speaker in the initial dialogue speech. Based on the speaker separation results, overlapping speech in the dialogue is identified where speaker speech overlap exists.
9. The method according to claim 5, characterized in that, The interaction scenario between the first speaker and the second speaker further includes: the first speaker and the second speaker engaging in multi-turn dialogue, wherein the initial dialogue voice is multi-turn dialogue voice, and / or, the initial dialogue text is multi-turn dialogue text.
10. The method according to any one of claims 1-3, characterized in that, The process of obtaining target dialogue data based on the target dialogue speech includes: Based on the timestamps corresponding to the speech content of each speaker in the target dialogue voice, the duration of the target dialogue voice and the time interval between different speakers are adjusted to obtain the adjusted target dialogue voice. The target dialogue data is obtained based on the adjusted target dialogue voice.
11. A voice interaction method, characterized in that, The method includes: Receive user voice messages; The user's voice is input into a pre-trained voice interaction model to obtain feedback voice in the target dialogue scenario; the voice interaction model is obtained by training a preset model based on the target dialogue data as described in any one of claims 1-10; the target dialogue scenario is: an interaction scenario between a first speaker and a second speaker. Output the feedback voice.
12. The method according to claim 11, characterized in that, The method is applied to a voice interaction system, and the interaction scenarios between the first speaker and the second speaker include: the first speaker interrupting the second speaker; when the first speaker is the user, the second speaker is the voice interaction system; when the first speaker is the voice interaction system, the second speaker is the user.
13. A dialogue data construction apparatus, characterized in that, The device includes: The acquisition module is used to acquire initial dialogue data to be processed; the initial dialogue data includes: initial dialogue text, the initial dialogue text does not contain dialogue content under the target dialogue scenario, the target dialogue scenario is: the interaction scenario between the first speaker and the second speaker; A construction module is used to construct target dialogue speech including dialogue content under the target dialogue scenario based on the initial dialogue text; obtain target dialogue data based on the target dialogue speech; use the target dialogue data to train a preset model to obtain a voice interaction model; and use the voice interaction model to perform voice interaction with the user under the target dialogue scenario.
14. An electronic device, characterized in that, include: The processor and the memory; the processor and the memory are communicatively connected. The memory stores computer-executed instructions; The processor executes computer execution instructions stored in the memory to implement the method as described in any one of claims 1-11.
15. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, are used to implement the method as described in any one of claims 1-11.
16. A computer program product, characterized in that, Includes a computer program that, when executed by a processor, implements the method described in any one of claims 1-11.