Method and apparatus for using voice to interrupt agent, medium, device and product

By extracting audio and text features from speech frames and combining them with a large language model (LLM) and a prediction module, the accuracy and immediacy of speech interruption in full-duplex dialogue systems are solved, thereby improving the system's stability and response speed.

WO2026138536A1PCT designated stage Publication Date: 2026-07-02BEIJING ZITIAO NETWORK TECH CO LTD +1

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
BEIJING ZITIAO NETWORK TECH CO LTD
Filing Date
2025-12-12
Publication Date
2026-07-02

Smart Images

  • Figure CN2025142028_02072026_PF_FP_ABST
    Figure CN2025142028_02072026_PF_FP_ABST
Patent Text Reader

Abstract

A method and apparatus for using a voice to interrupt an agent, a medium, a device, and a product, relating to the technical field of voice processing. The method comprises: in the process of an agent outputting information, acquiring original information, the original information comprising at least a voice frame; extracting a feature from the original information; on the basis of the feature, obtaining a text and an interrupt prediction result, the text at least being used for the agent to determine information to be output by the agent; and, in response to the interrupt prediction result being used to represent interrupting the agent outputting information, interrupting the agent outputting information. On the basis of the feature extracted from the original information, the present disclosure obtains the text and the interrupt prediction result, that is, while performing a voice recognition task to obtain the text, synchronously determines the interrupt prediction result. Thus, in full-duplex real-time voice interaction scenarios, while generating texts in real time, the present disclosure can quickly respond to interrupt demands of users so as to improve the instantaneity of interrupts.
Need to check novelty before this filing date? Find Prior Art

Description

Methods, devices, media, equipment and products for interrupting intelligent agents using voice

[0001] Cross-reference to related applications

[0002] This application claims priority to Chinese Patent Application No. 202411934410.3, filed on December 25, 2024, the disclosure of which is incorporated herein by reference in its entirety. Technical Field

[0003] This disclosure relates to a method, apparatus, medium, device, and product for interrupting an intelligent agent using voice. Background Technology

[0004] In the development of voice interaction systems, full-duplex dialogue systems provide an advanced human-computer interaction mode, allowing the system to respond and output information while capturing and understanding the user's voice input, in response to the user's voice input.

[0005] Voice interruption is an important technology that makes full-duplex dialogue systems more human-like. It allows users to interrupt the current output of the full-duplex dialogue system, and the full-duplex dialogue system can execute other responses based on the user's current voice. However, the relevant technology has certain limitations in terms of the accuracy and immediacy of interruption. Summary of the Invention

[0006] In a first aspect, this disclosure provides a method for interrupting an intelligent agent using voice, including:

[0007] During the process of the intelligent agent outputting information, the original information is acquired, and the original information includes at least voice frames.

[0008] Extract features from the original information;

[0009] Based on the aforementioned features, text and interruption prediction results are obtained, and the text is used at least by the agent to determine the information it wants to output.

[0010] In response to the interruption prediction result used to characterize the interruption of the agent's output information, the agent's output information is interrupted.

[0011] Secondly, this disclosure provides a device for interrupting an intelligent agent using voice, comprising:

[0012] The acquisition module is used to acquire raw information during the process of the intelligent agent outputting information, wherein the raw information includes at least voice frames;

[0013] The feature extraction module is used to extract features from the original information;

[0014] A processing module is used to obtain text and interruption prediction results based on the features, wherein the text is used at least by the agent to determine the information it wants to output;

[0015] An interruption module is used to interrupt the agent's output information in response to the interruption prediction result, which is used to characterize interrupting the agent's output information.

[0016] Thirdly, this disclosure provides a computer-readable medium having a computer program stored thereon, which, when executed by a processing device, implements the steps of the method described in the first aspect.

[0017] Fourthly, this disclosure provides an electronic device, comprising:

[0018] A storage device on which computer programs are stored;

[0019] A processing device for executing the computer program in the storage device to implement the steps of the method described in the first aspect.

[0020] Fifthly, this disclosure provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the method described in the first aspect. Attached Figure Description

[0021] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. Throughout the drawings, the same or similar reference numerals denote the same or similar elements. It should be understood that the drawings are schematic, and the originals and elements are not necessarily drawn to scale. In the drawings:

[0022] Figure 1 is a flowchart illustrating a method for interrupting an intelligent agent using voice according to an embodiment of the present disclosure.

[0023] Figure 2 is an architecture diagram of a model for determining text and interruption prediction results according to an embodiment of the present disclosure.

[0024] Figure 3 is a block diagram illustrating a device for interrupting an intelligent agent using voice according to an embodiment of the present disclosure.

[0025] Figure 4 is a schematic diagram of the structure of an electronic device according to an embodiment of the present disclosure. Detailed Implementation

[0026] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.

[0027] It should be understood that the steps described in the method embodiments of this disclosure may be performed in different orders and / or in parallel. Furthermore, the method embodiments may include additional steps and / or omit the steps shown. The scope of this disclosure is not limited in this respect.

[0028] The term "comprising" and its variations as used herein are open-ended inclusions, meaning "including but not limited to". The term "based on" means "at least partially based on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Definitions of other terms will be given in the description below.

[0029] It should be noted that the concepts of "first" and "second" mentioned in this disclosure are used only to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or their interdependencies.

[0030] It should be noted that the terms "a" and "a plurality of" used in this disclosure are illustrative rather than restrictive, and those skilled in the art should understand that, unless otherwise expressly indicated in the context, they should be understood as "one or more".

[0031] The names of messages or information exchanged between multiple devices in the embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

[0032] It is understood that before using the technical solutions disclosed in the various embodiments of this disclosure, users should be informed of the types, scope of use, and usage scenarios of the personal information involved in this disclosure in an appropriate manner in accordance with relevant laws and regulations, and user authorization should be obtained.

[0033] For example, upon receiving a user's active request, a prompt message is sent to the user to explicitly inform them that the requested operation will require the acquisition and use of the user's personal information. This allows the user to independently choose whether to provide personal information to the software or hardware, such as the electronic device, application, server, or storage medium performing the operations of this disclosed technical solution, based on the prompt message.

[0034] As an optional but non-limiting implementation, in response to a user's active request, sending a prompt message to the user can be done via a pop-up window, where the prompt message can be presented in text format. Furthermore, the pop-up window can also include a selection control allowing the user to choose "agree" or "disagree" to provide personal information to the electronic device.

[0035] It is understood that the above notification and user authorization process are merely illustrative and do not constitute a limitation on the implementation of this disclosure. Other methods that comply with relevant laws and regulations may also be applied to the implementation of this disclosure.

[0036] Meanwhile, it is understood that the data involved in this technical solution (including but not limited to the data itself, the acquisition or use of the data) shall comply with the requirements of relevant laws, regulations and related provisions.

[0037] In the development of voice interaction systems, full-duplex dialogue systems provide an advanced human-computer interaction mode, allowing the system to respond and output information while capturing and understanding the user's voice input, in response to the user's voice input.

[0038] Voice interruption functionality is an important speech recognition technology that makes full-duplex dialogue systems more human-like. It allows users to interrupt the content being read by the full-duplex dialogue system at any time. The full-duplex dialogue system needs to quickly and accurately identify the interruption intention and adjust its content accordingly. However, related technologies have certain limitations in terms of the accuracy and immediacy of interruption.

[0039] For example, some related technologies use voice detection to interrupt, which assumes that the user intends to interrupt the full-duplex dialogue system if human voice is detected in the speech. However, this method is easily affected by background noise, which can lead to false interruptions.

[0040] For example, some related technologies use speech recognition to interrupt speech, which determines whether an interruption is necessary by recognizing words in the speech. However, this method is prone to accidental interruption in complex scenarios such as conversations with others or background speech or device voice broadcasts, reducing the stability and reliability of the dialogue system.

[0041] For example, some related technologies use a non-streaming approach to interruption, relying on the complete semantics of a sentence to determine whether to interrupt. However, this approach has a slow response time and cannot meet the immediacy requirements of real-time voice scenarios.

[0042] In summary, the relevant technologies have certain limitations in terms of the accuracy and immediacy of interruption.

[0043] In view of this, embodiments of the present disclosure provide a method, apparatus, medium, device, and product for interrupting an intelligent agent using voice.

[0044] First, it should be noted that the intelligent agent described below is similar to the full-duplex dialogue system mentioned above. The intelligent agent captures and understands the user's voice input while outputting information. Furthermore, the intelligent agent can be a voice assistant.

[0045] The embodiments of this disclosure will be explained and described below with reference to the accompanying drawings.

[0046] Figure 1 is a flowchart illustrating a method for interrupting an intelligent agent using voice according to an embodiment of the present disclosure. This method for interrupting an intelligent agent using voice can be applied to an electronic device, and can be executed by a device for interrupting an intelligent agent using voice. The device for interrupting an intelligent agent using voice can be implemented in software and / or hardware and can be configured in an electronic device. Referring to Figure 1, the method for interrupting an intelligent agent using voice may include the following steps:

[0047] Step 110: During the process of the intelligent agent outputting information, acquire the original information, which includes at least the voice frames;

[0048] Step 120: Extract features from the original information;

[0049] Step 130: Based on the features, obtain the text and the interruption prediction result. The text is used at least to help the agent determine the information it wants to output.

[0050] Step 140: In response to the interruption prediction result being used to characterize the interruption agent's output information, the interruption agent's output information is used.

[0051] By using the above method, based on the features extracted from the original information, the text and interruption prediction results are obtained. That is, while performing the speech recognition task to obtain the text, the interruption prediction results are determined simultaneously. Thus, in the full-duplex real-time voice interaction scenario, the interruption needs of users can be quickly responded to while the text is generated in real time, thereby improving the immediacy of interruption.

[0052] It is worth noting that the text obtained based on the features mentioned above can be the text recognition result corresponding to the speech recognition task that the agent needs to perform. This text recognition result can be used by the agent to determine the information it wants to output; that is, based on the text recognition result, the agent can determine the information it needs to output. The processing of text obtained based on features will be referred to as speech recognition processing below.

[0053] In this disclosure, there may be one or more voice frames. It is understood that the fewer the number of voice frames, the higher the immediacy of determining when to interrupt the agent's output information.

[0054] In this disclosure, the voice frame can be acquired by the electronic device in real time. Furthermore, the voice frame can be detected in real time by an intelligent agent mounted on the electronic device; the voice frame may include the user's voice, background noise, or silence.

[0055] In this disclosure, as an example, the intelligent agent can converse with the user. For instance, the intelligent agent can output corresponding information in real time based on the information input by the user. The user's input can be generated through voice input or text editing; this embodiment does not limit the specific input.

[0056] In some possible ways, the original information may also include historical text and historical dialogue text output by the agent, where historical text is obtained from previously performed speech recognition processing. Based on this, features may include audio features and text features. Further, step 120 above can be implemented as follows: feature extraction is performed on the speech frame using an audio encoding module to obtain audio features corresponding to the speech frame; feature extraction is performed on the target text using a text encoding module to obtain text features corresponding to the target text, where the target text includes at least one of historical text and historical dialogue text.

[0057] It is worth noting that historical text is derived from features extracted from the original information acquired historically. Historical text can represent semantic information related to the detected speech frames. As an example, in the speech recognition processing of a series of speech frames, in the (N+1)th speech recognition process, the historical text obtained from the previous M speech recognition processes can be used in the (N+1)th speech recognition process, where M is less than N; for example, M can be 1.

[0058] Among them, historical dialogue text can represent the dialogue scene information and context information in the agent's current dialogue.

[0059] The text encoding module can be a trained LLM (Large Language Model). Leveraging the semantic understanding capabilities of the LLM, it extracts features from the target text to obtain the semantic information representing the target text. The semantic understanding capability of the LLM is achieved through pre-training on large-scale text data, enabling it to learn the inherent structure and semantic information of the language.

[0060] The target text may include either historical text or historical dialogue text, or the target text may include both historical text and historical dialogue text.

[0061] As can be seen from the above, since it is necessary to obtain the text and interruption prediction results based on features, when obtaining the text and interruption prediction results based on features, features representing dialogue scene information and context information can be extracted from historical dialogue text, and features representing semantics related to speech frames can be extracted from historical text, based on the audio feature information corresponding to the speech frames. This increases the expressive power of the features, improves the accuracy of interruption prediction results, and thus improves the accuracy of speech interruption based on interruption prediction results.

[0062] In a possible manner, the steps described above for obtaining text and interruption prediction results based on features may include: processing the features through a first prediction module to obtain the probability that the phoneme corresponding to the speech frame belongs to each basic unit in a preset dictionary; processing the features through a second prediction module to obtain the probability that the speech frame belongs to a blank frame; and obtaining text based on the probability that the phoneme corresponding to the speech frame belongs to each basic unit in the preset dictionary and the probability that the speech frame belongs to a blank frame, wherein the text includes the basic unit to which the phoneme corresponding to the speech frame belongs or the text is empty.

[0063] In this embodiment, the features are extracted from the original information. When the original information includes speech frames, historical text, and historical dialogue text, the features include audio features corresponding to the speech frames and text features corresponding to the target text. It is understood that the target text in this embodiment may include at least one of historical text and historical dialogue text.

[0064] In the field of speech recognition technology, a phoneme is the smallest unit of speech defined based on the natural attributes of speech. In this embodiment, the basic unit can be a character or a letter.

[0065] It is worth noting that during the real-time detection of speech by the intelligent agent, a speech segment may contain multiple speech frames, and there may be blank frames among these multiple speech frames. Therefore, in speech recognition processing, it is not only necessary to identify the character or letter to which the phoneme corresponding to the speech frame belongs, but also to identify whether the speech frame is a blank frame. If it is a blank frame, the resulting text is empty.

[0066] By using the above method, different prediction modules are used to determine the basic unit to which the phoneme corresponding to the speech frame belongs or whether the speech frame belongs to a blank frame.

[0067] In a possible manner, the aforementioned first prediction module includes a first prediction submodule, a second prediction submodule, and a fusion submodule. The step of processing features through the first prediction module to obtain the probability that the phoneme corresponding to the speech frame belongs to each basic unit in the preset dictionary includes: processing audio features through the first prediction submodule to obtain a first probability that the phoneme corresponding to the speech frame belongs to each basic unit in the preset dictionary; processing text features through the second prediction submodule to obtain a second probability that the phoneme corresponding to the speech frame belongs to each basic unit in the preset dictionary; and fusing the first and second probabilities through the fusion submodule to obtain the probability that the phoneme corresponding to the speech frame belongs to each basic unit in the preset dictionary.

[0068] The first prediction submodule and the second prediction submodule can each include a fully connected layer and a softmax layer. The fully connected layer is responsible for further extracting features from the features (audio features or text features) extracted from the original information and performing linear transformations. The softmax layer is responsible for converting the features after linear transformation into a probability distribution.

[0069] The fusion submodule can add the first probability and the second probability in the logarithmic domain to achieve the fusion of the first and second probabilities. The result of adding the first and second probabilities is the probability that the phoneme corresponding to the speech frame belongs to each basic unit in the preset dictionary. It is worth noting that adding the first probability and the second probability means adding the logarithm corresponding to the first probability and the logarithm corresponding to the second probability. The result of the addition is equal to the logarithm of the product of the first probability and the second probability.

[0070] By using the above methods, different prediction modules are used to process the feature information of different modalities, avoiding mutual interference between information of different modalities, thereby improving the feature expression capability and thus improving the stability of feature-based speech recognition processing.

[0071] In a possible manner, the steps described above for obtaining the text and interruption prediction results based on features may further include: processing the features through an interruption prediction module to obtain interruption prediction results, whereby the interruption prediction results are used to characterize the probability of determining an interruption. In this case, step 140 above can be implemented as follows: determining whether a preset interruption condition is met based on the probability characterized by at least one interruption prediction result; if the preset interruption condition is met, determining to interrupt the agent's output information; if the preset interruption condition is not met, determining not to interrupt the agent's output information.

[0072] In this embodiment, the feature information is extracted from the original information. When the original information includes speech frames, historical text, and historical dialogue text, the features include audio features corresponding to the speech frames and text features corresponding to the target text. It is understood that the target text in this embodiment may include at least one of historical text and historical dialogue text.

[0073] After obtaining the interruption prediction result representing the probability of interruption, it is possible to determine whether the preset interruption condition is met based on the probability represented by at least one interruption prediction result, and then determine whether to interrupt the agent's output response information.

[0074] The preset interruption condition can be related to the number of interruption prediction results. For example, in determining whether to interrupt the agent's output information based on two consecutive interruption prediction results, the preset interruption condition could be that the probabilities represented by the two consecutive interruption prediction results are greater than or equal to a preset interruption probability. In this case, if the probability represented by the first interruption prediction result is greater than or equal to the preset interruption probability, and the probability represented by the second interruption prediction result is also greater than or equal to the preset interruption probability, the preset interruption condition is determined to be met. If the probability represented by either the first or second interruption prediction result is less than the preset interruption probability, the preset interruption condition is determined not to be met. The preset interruption probability can be set according to actual conditions, and this embodiment does not impose any limitations on it.

[0075] For example, when determining whether to interrupt an agent's output based on three consecutive interruption prediction results, the preset interruption condition could be that at least two of the probabilities represented by the three consecutive interruption prediction results are greater than or equal to the preset interruption probability. For instance, if the probability of the first interruption prediction result is greater than or equal to the preset interruption probability, the probability of the second interruption prediction result is less than the preset interruption probability, and the probability of the third interruption prediction result is greater than or equal to the preset interruption probability, then the preset interruption condition is satisfied. Conversely, if the probabilities of both the first and second interruption prediction results are less than the preset interruption probability, then the preset interruption condition is directly determined not to be satisfied. In other words, when determining whether to interrupt an agent's output based on three consecutive interruption prediction results, if the probabilities of two consecutive interruption prediction results are both less than the preset interruption probability, then the preset interruption condition is directly determined not to be satisfied.

[0076] It is worth noting that the number of interruption prediction results that determine whether the preset interruption conditions are met is negatively correlated with immediacy; that is, the more interruption prediction results are relied upon, the lower the immediacy.

[0077] By inferring the probability of interruption and relying on the preset interruption conditions, it is possible to determine whether to interrupt the agent's output information or not.

[0078] Understandably, if it is determined not to interrupt the agent's output of information, the agent maintains its current action; that is, if the agent is currently outputting information, it continues to output information.

[0079] In some possible ways, the interruption prediction result can directly represent whether or not to interrupt. In this case, continuing the example above, taking the determination of whether to interrupt the agent's output information based on two consecutive interruption prediction results as an example, the preset interruption condition could be that if two consecutive interruption prediction results represent interruption, then the preset interruption condition is satisfied. As another example, taking the determination of whether to interrupt the agent's output information based on three consecutive interruption prediction results as an example, the preset interruption condition could be that if at least two of the three consecutive interruption prediction results represent interruption, then the preset interruption condition is satisfied.

[0080] The following uses original information including voice frames, historical text, and historical dialogue text as examples to further explain and illustrate the embodiments of this disclosure.

[0081] Figure 2 is an architecture diagram of a model for determining text and interruption prediction results according to an embodiment of the present disclosure. Referring to Figure 2, the model may include a speech recognition model consisting of an audio encoding module, a text encoding module, a first prediction module, a second prediction module, a first output module, and a second output module, as well as an interruption prediction module independent of the speech recognition model. Hereinafter, this model is referred to as the target model. Further, the first prediction module includes a first prediction submodule and a second prediction submodule.

[0082] The output space of the speech recognition model can be predefined, encompassing all possible outputs that the model can recognize and generate. This could include, for example, all phonemes, characters, or letters in a language.

[0083] Referring again to Figure 2, the audio encoding module accepts the input speech frame and processes the speech frame to obtain audio features; further, the audio features are input to the first prediction submodule, the second prediction module and the interruption prediction module respectively.

[0084] The text encoding module accepts the concatenated result as input, which is a concatenation of historical text and historical dialogue text. The text encoding module processes the concatenated result to obtain text features; further, the text features are input into the second prediction submodule, the second prediction module, and the interruption prediction module, respectively.

[0085] The first prediction submodule processes its input to obtain a first probability; the second prediction submodule processes its input to obtain a second probability; further, the fusion submodule fuses the first and second probabilities to obtain the probability that the phoneme corresponding to the speech frame belongs to each basic unit in the preset dictionary. This probability is passed through the first output module to obtain the basic unit to which the phoneme corresponding to the speech frame output by the first output module belongs.

[0086] The second prediction module processes its input to obtain the probability that a speech frame belongs to a blank frame. This probability is then passed to the second output module to determine whether the speech frame output by the second output module belongs to a blank frame.

[0087] As an example, if the basic unit to which the phoneme corresponding to the speech frame belongs can be identified, the speech recognition model outputs the identified basic unit, i.e., the text in the above embodiment includes the basic unit to which the phoneme corresponding to the speech frame belongs; if the basic unit to which the phoneme corresponding to the speech frame belongs cannot be identified, and the current speech frame is identified as a blank frame, the speech recognition module outputs a null value, i.e., the text in the above embodiment is empty. The interruption prediction module processes its input to obtain the interruption prediction result.

[0088] By using the above method, an interruption prediction module is added to the speech recognition model, thereby simultaneously performing real-time interruption judgment based on basic units while the speech recognition model generates text in real time.

[0089] Referring again to Figure 2, under the model architecture shown in Figure 2, the training method for the speech recognition model and the interruption prediction module can be as follows: Obtain a sample set, which includes multiple samples and corresponding annotation information for each sample. The samples include sample speech frames, sample historical text, and sample historical dialogue text. The annotation information includes text annotation results and interruption annotation results. Determine the first loss of the sample set under the speech recognition model and the second loss of the sample set under the interruption prediction module. The first loss characterizes the difference between the text output by the speech recognition model for the sample and the text annotation results of the sample, and the second loss characterizes the difference between the interruption prediction results output by the interruption prediction module for the sample and the interruption annotation results of the sample. Construct a loss function based on the first and second losses. Jointly train the speech recognition model and the interruption prediction module based on the loss function.

[0090] The sample audio frames, sample historical texts, and sample historical dialogue texts in the sample correspond one-to-one with the audio frames, historical texts, and historical dialogue texts in the above embodiments. The explanations and descriptions of the sample audio frames, sample historical texts, and sample historical dialogue texts can be found in the above explanations and descriptions of audio frames, historical texts, and historical dialogue texts. This embodiment will not repeat them here.

[0091] The text annotation results and interruption annotation results can be obtained by manual annotation.

[0092] It is understandable that the interruption prediction module processes the features extracted by the text encoding module and audio encoding module in the speech recognition model.

[0093] Specifically, the updating of the speech recognition model and the interruption prediction module can be stopped when the loss functions constructed by the first loss and the second loss meet a preset stopping condition, or when the number of updates to the speech recognition model and the interruption prediction module reaches a preset number. Speech recognition is then performed using the speech recognition model corresponding to the first parameter at the time of stopping the update, and the interruption prediction result is determined by the interruption prediction module corresponding to the second parameter at the time of stopping the update. It should be understood that the first parameter is the parameter that the speech recognition model needs to update, and the second parameter is the parameter that the interruption prediction module needs to update.

[0094] By employing the above method and utilizing joint training, the parameters of the speech recognition model and the interruption prediction module are updated collaboratively, thereby improving the stability of the speech recognition model and the interruption prediction model in a full-duplex dialogue system with interruption functionality.

[0095] Figure 3 is a block diagram illustrating a device for interrupting an intelligent agent using voice according to an embodiment of the present disclosure. Referring to Figure 3, the device 300 for interrupting an intelligent agent using voice may include:

[0096] The acquisition module 301 is used to acquire raw information during the process of the intelligent agent outputting information, wherein the raw information includes at least voice frames;

[0097] Feature extraction module 302 is used to extract features from the original information;

[0098] Processing module 303 is used to obtain text and interruption prediction results based on the features, wherein the text is used at least by the agent to determine the information it wants to output;

[0099] Interruption module 304 is used to interrupt the agent's output information in response to the interruption prediction result used to characterize interrupting the agent's output information.

[0100] Optionally, the feature extraction module 302 includes:

[0101] The first feature extraction submodule is used to extract features from the speech frame through the audio encoding module to obtain audio features corresponding to the speech frame.

[0102] The second feature extraction submodule is used to extract features from the target text through the text encoding module to obtain the text features corresponding to the target text. The target text includes at least one of the historical text and the historical dialogue text.

[0103] Optionally, the processing module 303 includes:

[0104] The first processing submodule is used to process the features through the first prediction module to obtain the probability that the phoneme corresponding to the speech frame belongs to each basic unit in the preset dictionary.

[0105] The second processing submodule is used to process the features through the second prediction module to obtain the probability that the speech frame belongs to a blank frame;

[0106] The third processing submodule is used to obtain text based on the probability that the phoneme corresponding to the speech frame belongs to each basic unit in the preset dictionary and the probability that the speech frame belongs to a blank frame. The text includes the basic unit to which the phoneme corresponding to the speech frame belongs or the text is empty.

[0107] Optionally, the first prediction module includes a first prediction submodule, a second prediction submodule, and a fusion submodule, wherein the first processing submodule is further configured to:

[0108] The audio features are processed by the first prediction submodule to obtain the first probability that the phoneme corresponding to the speech frame belongs to each basic unit in the preset dictionary.

[0109] The text features are processed by the second prediction submodule to obtain the second probability that the phoneme corresponding to the speech frame belongs to each basic unit in the preset dictionary;

[0110] By fusing the first probability and the second probability through the fusion submodule, the probability that the phoneme corresponding to the speech frame belongs to each basic unit in the preset dictionary is obtained.

[0111] Optionally, the processing module 303 further includes:

[0112] The features are processed by the interruption prediction module to obtain the interruption prediction result, which is used to characterize the probability of determining an interruption.

[0113] The interruption module 304 is further used for:

[0114] Based on the probability represented by at least one of the interruption prediction results, determine whether the preset interruption condition is met;

[0115] If the preset interruption condition is met, the output information of the intelligent agent is interrupted.

[0116] If the preset interruption condition is not met, the agent's output information will not be interrupted.

[0117] Optionally, the audio encoding module, the text encoding module, the first prediction module, and the second prediction module are modules in a speech recognition model. The device 300 for interrupting the intelligent agent using speech further includes a training module, which is used for:

[0118] Obtain a sample set, which includes multiple samples and annotation information corresponding to each sample. The samples include sample speech frames, sample historical text, and sample historical dialogue text. The annotation information includes text annotation results and interruption annotation results.

[0119] The first loss of the sample set under the speech recognition model and the second loss of the sample set under the interruption prediction module are determined. The first loss is used to characterize the difference between the text output by the speech recognition model for the sample and the text annotation result of the sample. The second loss is used to characterize the difference between the interruption prediction result output by the interruption prediction module for the sample and the interruption annotation result of the sample.

[0120] Construct a loss function based on the first loss and the second loss;

[0121] The speech recognition model and the interruption prediction module are jointly trained based on the loss function.

[0122] The implementation methods of each module in the device 300 that uses voice to interrupt the intelligent agent can refer to the above-mentioned related embodiments, and will not be repeated here.

[0123] Based on the same inventive concept, this disclosure also provides a computer-readable medium having a computer program stored thereon, which, when executed by a processing device, implements the steps of the method for interrupting an intelligent agent using voice described above.

[0124] Based on the same inventive concept, this disclosure also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the above-described method for interrupting an intelligent agent using voice.

[0125] Based on the same inventive concept, this disclosure also provides an electronic device, including:

[0126] A storage device on which computer programs are stored;

[0127] A processing device is configured to execute the computer program in the storage device to implement the steps of the method for interrupting an intelligent agent using voice.

[0128] Referring now to FIG4, a schematic diagram of the structure of an electronic device 400 suitable for implementing embodiments of the present disclosure is shown. The terminal devices in embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, laptops, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and fixed terminals such as digital TVs and desktop computers. The electronic device shown in FIG4 is merely an example and should not impose any limitation on the functionality and scope of use of embodiments of the present disclosure.

[0129] As shown in Figure 4, the electronic device 400 may include a processing unit (e.g., a central processing unit, a graphics processing unit, etc.) 401, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 402 or a program loaded from a storage device 408 into a random access memory (RAM) 403. The RAM 403 also stores various programs and data required for the operation of the electronic device 400. The processing unit 401, ROM 402, and RAM 403 are interconnected via a bus 404. An input / output (I / O) interface 405 is also connected to the bus 404.

[0130] Typically, the following devices can be connected to I / O interface 405: input devices 406 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 407 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 408 including, for example, magnetic tapes, hard disks, etc.; and communication devices 409. Communication device 409 allows electronic device 400 to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 4 shows electronic device 400 with various devices, it should be understood that it is not required to implement or possess all of the devices shown. More or fewer devices may be implemented or possessed alternatively.

[0131] In particular, according to embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this disclosure include a computer program product comprising a computer program carried on a non-transitory computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication device 409, or installed from storage device 408, or installed from ROM 402. When the computer program is executed by processing device 401, it performs the functions defined in the methods of embodiments of this disclosure.

[0132] It should be noted that the computer-readable medium described in this disclosure can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this disclosure, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in connection with an instruction execution system, apparatus, or device. In this disclosure, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.

[0133] In some implementations, electronic devices can communicate using any currently known or future-developed network protocol, such as HTTP (Hypertext Transfer Protocol), and can interconnect with digital data communication (e.g., communication networks) of any form or medium. Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), the Internet (e.g., the Internet of Things), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future-developed networks.

[0134] The aforementioned computer-readable medium may be included in the aforementioned electronic device; or it may exist independently and not assembled into the electronic device.

[0135] The aforementioned computer-readable medium carries one or more programs that, when executed by the electronic device, cause the electronic device to: acquire raw information, including at least speech frames, during the process of the agent outputting information; extract features from the raw information; obtain text and an interruption prediction result based on the features, the text being used at least by the agent to determine the information it wants to output; and interrupt the agent's output information in response to the interruption prediction result being used to characterize interrupting the agent's output information.

[0136] Computer program code for performing the operations of this disclosure can be written in one or more programming languages ​​or a combination thereof, including but not limited to object-oriented programming languages ​​such as Java, Smalltalk, and C++, as well as conventional procedural programming languages ​​such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0137] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0138] The modules described in the embodiments of this disclosure can be implemented in software or hardware. The names of the modules are not, in some cases, limiting the scope of the module itself.

[0139] The functions described above in this document can be performed, at least in part, by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: Field Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Application Standard Products (ASSPs), System-on-Chip (SoCs), Complex Programmable Logic Devices (CPLDs), and so on.

[0140] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0141] The above description is merely a preferred embodiment of this disclosure and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described concept. For example, technical solutions formed by substituting the above features with (but not limited to) technical features disclosed in this disclosure that have similar functions.

[0142] Furthermore, while the operations are described in a specific order, this should not be construed as requiring these operations to be performed in the specific order shown or in a sequential order. In certain environments, multitasking and parallel processing may be advantageous. Similarly, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of this disclosure. Certain features described in the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented individually or in any suitable sub-combination in multiple embodiments.

[0143] Although the subject matter has been described using language specific to structural features and / or methodological logic, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or actions described above. Rather, the specific features and actions described above are merely illustrative forms of implementing the claims. Regarding the apparatus in the above embodiments, the specific manner in which the various modules perform their operations has been described in detail in the embodiments relating to the method, and will not be elaborated upon here.

Claims

1. A method for interrupting an intelligent agent using voice, comprising: During the process of the intelligent agent outputting information, the original information is acquired, and the original information includes at least voice frames. Extract features from the original information; Based on the aforementioned features, text and interruption prediction results are obtained, and the text is used at least by the agent to determine the information it wants to output. In response to the interruption prediction result used to characterize the interruption of the agent's output information, the agent's output information is interrupted.

2. The method according to claim 1, wherein, The original information also includes historical text and historical dialogue text output by the agent. The features include audio features and text features. Extracting features from the original information includes: The audio encoding module extracts features from the speech frame to obtain the audio features corresponding to the speech frame. The target text is subjected to feature extraction by a text encoding module to obtain the text features corresponding to the target text. The target text includes at least one of the historical text and the historical dialogue text.

3. The method according to claim 2, wherein, The process of obtaining text and interruption prediction results based on the features includes: The features are processed by the first prediction module to obtain the probability that the phoneme corresponding to the speech frame belongs to each basic unit in the preset dictionary. The second prediction module processes the features to obtain the probability that the speech frame belongs to a blank frame; Based on the probability that the phoneme corresponding to the speech frame belongs to each basic unit in the preset dictionary and the probability that the speech frame belongs to a blank frame, the text is obtained. The text includes the basic unit to which the phoneme corresponding to the speech frame belongs or the text is empty.

4. The method according to claim 3, wherein, The first prediction module includes a first prediction submodule, a second prediction submodule, and a fusion submodule. The first prediction module processes the features to obtain the probability that the phoneme corresponding to the speech frame belongs to each basic unit in a preset dictionary, including: The audio features are processed by the first prediction submodule to obtain the first probability that the phoneme corresponding to the speech frame belongs to each basic unit in the preset dictionary. The text features are processed by the second prediction submodule to obtain the second probability that the phoneme corresponding to the speech frame belongs to each basic unit in the preset dictionary; By fusing the first probability and the second probability through the fusion submodule, the probability that the phoneme corresponding to the speech frame belongs to each basic unit in the preset dictionary is obtained.

5. The method according to any one of claims 1-4, wherein, The process of obtaining the text and interruption prediction results based on the features also includes: The features are processed by the interruption prediction module to obtain the interruption prediction result, which is used to characterize the probability of determining an interruption. The response to the interruption prediction result is used to characterize the interruption of the agent's output information, and the interruption of the agent's output information includes: Based on the probability represented by at least one of the interruption prediction results, determine whether the preset interruption condition is met; If the preset interruption condition is met, the output information of the intelligent agent is interrupted. If the preset interruption condition is not met, the agent's output information will not be interrupted.

6. The method according to claim 5, wherein, The audio encoding module, the text encoding module, the first prediction module, and the second prediction module are modules in a speech recognition model, and the method further includes: Obtain a sample set, which includes multiple samples and annotation information corresponding to each sample. The samples include sample speech frames, sample historical text, and sample historical dialogue text. The annotation information includes text annotation results and interruption annotation results. The first loss of the sample set under the speech recognition model and the second loss of the sample set under the interruption prediction module are determined. The first loss is used to characterize the difference between the text output by the speech recognition model for the sample and the text annotation result of the sample. The second loss is used to characterize the difference between the interruption prediction result output by the interruption prediction module for the sample and the interruption annotation result of the sample. Construct a loss function based on the first loss and the second loss; The speech recognition model and the interruption prediction module are jointly trained based on the loss function.

7. A device for interrupting an intelligent agent using voice, comprising: The acquisition module is configured to acquire raw information during the process of the intelligent agent outputting information, the raw information including at least voice frames; The feature extraction module is configured to extract features from the original information; The processing module is configured to obtain text and interruption prediction results based on the features, wherein the text is used at least by the agent to determine the information it wants to output; An interruption module is configured to interrupt the agent's output information in response to the interruption prediction result used to characterize interrupting the agent's output information.

8. A computer-readable medium having a computer program stored thereon, wherein, When the computer program is executed by the processing device, it implements the steps of the method according to any one of claims 1-6.

9. An electronic device, comprising: A storage device on which computer programs are stored; A processing device for executing the computer program in the storage device to implement the steps of the method according to any one of claims 1-6.

10. A computer program product comprising a computer program, wherein, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1-6.