Voice-based action generation method and device, electronic equipment and storage medium

By recognizing the action intent and prosodic features in the target speech, a sequence of body movements that matches the target speech is generated, solving the problem of monotonous body movements in existing technologies and achieving more natural and coordinated body movement generation.

CN115762574BActive Publication Date: 2026-06-26IFLYTEK CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
IFLYTEK CO LTD
Filing Date
2022-11-16
Publication Date
2026-06-26

Smart Images

  • Figure CN115762574B_ABST
    Figure CN115762574B_ABST
Patent Text Reader

Abstract

The application provides a speech-based action generation method and device, electronic equipment and a storage medium. The method comprises the following steps: determining an action intention contained in a target speech, and determining a first action sequence matched with the action intention; extracting a speech rhythm feature from the target speech, and predicting a second action sequence matched with the speech rhythm feature based on the speech rhythm feature; and performing fusion processing on the first action sequence and the second action sequence to generate an action sequence matched with the target speech. The above scheme generates an action sequence matched with the target speech through multi-dimensional information, so that the generated action sequence matched with the target speech is more accurate, and is more natural and coordinated.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to a method, apparatus, electronic device and storage medium for voice-based action generation. Background Technology

[0002] A more natural way of human-computer interaction has always been a goal pursued by industry and academia. In recent years, multimodal interaction with virtual humans has been regarded as the next generation of interaction in the 5G+AI era and has received increasing attention. In the daily communication between people, body language plays a very important role in conveying information such as emphasis, attitude, and semantics more effectively. Therefore, how to synthesize more natural body language based on speech in human-computer interaction has also received continuous attention in recent years.

[0003] Conventional speech-based gesture synthesis methods involve pre-building a gesture library and then retrieving matching gestures from the library based on the machine's speech during human-computer interaction to synthesize gestures. However, this approach produces monotonous, stiff, and unnatural gestures. Summary of the Invention

[0004] Based on the aforementioned technological status, this application proposes a speech-based motion generation method, apparatus, electronic device, and storage medium, which can generate more natural and coordinated body movements that match speech.

[0005] To achieve the above-mentioned technical objectives, this application proposes the following technical solution:

[0006] A speech-based action generation method includes:

[0007] Determine the action intent contained in the target speech, and determine a first action sequence that matches the action intent;

[0008] Furthermore, speech prosodic features are extracted from the target speech, and a second action sequence matching the speech prosodic features is predicted based on the speech prosodic features;

[0009] The first action sequence and the second action sequence are fused together to generate an action sequence that matches the target speech.

[0010] Optionally, determining the action intent contained in the target speech includes:

[0011] The target speech is subjected to action intent classification processing based on preset action intent category labels to determine the action intent contained in the target speech.

[0012] Optionally, the step of performing action intent classification processing on the target speech based on preset action intent category labels to determine the action intent contained in the target speech includes:

[0013] For each preset action intent category label, the target speech is subjected to binary classification processing;

[0014] Based on the binary classification results corresponding to each preset action intention category label, the action intention contained in the target speech is determined;

[0015] The binary classification process includes determining whether the target speech contains the action intent corresponding to the action intent category label.

[0016] Optionally, determining the first action sequence that matches the action intention includes:

[0017] From a pre-built set of action codebooks, select action codebooks that match the action intent to generate a first action sequence;

[0018] The action codebooks in the action codebook set are used to combine to obtain arbitrary action sequences.

[0019] Optionally, the action codebook set is obtained through the following processing:

[0020] The acquired action sequence is split to obtain the action sequence unit corresponding to each action sequence. The acquired action sequence includes the sequence of actions in the semantic action library and the continuous action sequence of the target object in the speaking state.

[0021] By encoding the action sequence units corresponding to each action sequence and using the encoding results to recover the action sequence, the action code corresponding to each action sequence unit is determined.

[0022] The action codes corresponding to all action sequence units are deduplicated, and the deduplicated action codes are used as action codebooks to form an action codebook set.

[0023] Optionally, prosodic features are extracted from the target speech, including:

[0024] The target speech is input into a pre-trained prosodic feature extraction model to obtain the speech prosodic features of the target speech;

[0025] The prosodic feature extraction model is trained through a first training method and / or a second training method. The first training method is used to train the prosodic feature extraction model to extract speech prosodic features from the input speech, and the second training method is used to train the prosodic feature extraction model to filter out voiceprint features and text features when extracting speech prosodic features from the input speech.

[0026] Optionally, the training process of the prosodic feature extraction model includes:

[0027] Input the sample spectrogram into the prosodic feature extraction model to obtain the prosodic features extracted by the prosodic feature extraction model from the sample spectrogram;

[0028] The reconstructed spectrogram is obtained by using the prosodic features, as well as the voiceprint and text features of the sample spectrogram.

[0029] Based on the sample spectrogram and the reconstructed spectrogram, the spectrogram reconstruction loss is calculated;

[0030] The prosodic features are used to reconstruct voiceprint features and text features, resulting in reconstructed voiceprint features and reconstructed text features.

[0031] Based on the voiceprint features and the reconstructed voiceprint features, a first adversarial loss is calculated, and based on the text features and the reconstructed text features, a second adversarial loss is calculated.

[0032] The operational parameters of the prosodic feature extraction model are corrected with the goal of the spectrogram reconstruction loss being less than a preset first loss threshold and the first adversarial loss and the second adversarial loss being greater than a preset second loss threshold.

[0033] Optionally, predicting a second action sequence that matches the speech prosodic features based on the speech prosodic features includes:

[0034] The speech prosody features are input into a pre-trained action sequence prediction model to obtain an action codebook sequence that matches the speech prosody features, and the obtained action codebook sequence is used as a second action sequence.

[0035] The action sequence prediction model is used to predict the action codebook contained in the actions that match the speech prosodic features, and to use the predicted action codebook to form an action codebook sequence. The action codebook is the action code corresponding to the action sequence unit used to form a continuous action sequence.

[0036] Optionally, the method further includes determining the location range of the action intention in the target speech;

[0037] The first action sequence and the second action sequence are fused to generate an action sequence that matches the target speech, including:

[0038] The action sequence in the second action sequence that is located in the first position interval corresponding to the position interval is replaced with the first action sequence, and the replaced second action sequence is decoded to obtain an action sequence that matches the target speech.

[0039] A voice-based action generation device, comprising:

[0040] The first action prediction unit is used to determine the action intention contained in the target speech and to determine a first action sequence that matches the action intention;

[0041] The second action prediction unit is used to extract speech prosodic features from the target speech and predict a second action sequence that matches the speech prosodic features based on the speech prosodic features.

[0042] An action synthesis unit is used to fuse the first action sequence and the second action sequence to generate an action sequence that matches the target speech.

[0043] An electronic device, comprising:

[0044] Memory and processor;

[0045] The memory is connected to the processor and is used to store programs;

[0046] The processor is used to implement the above-described voice-based action generation method by running the program in the memory.

[0047] A storage medium storing a computer program, which, when executed by a processor, implements the above-described voice-based action generation method.

[0048] The speech-based action generation method proposed in this application can determine the action intention contained in the target speech, identify a first action sequence matching the action intention, and extract speech prosodic features from the target speech and predict a second action sequence matching the speech prosodic features based on these features. The above processing obtains the first and second action sequences matching the target speech through different dimensions. Based on this, the first and second action sequences are fused to obtain the action sequence matching the target speech. This scheme generates an action sequence matching the target speech through multi-dimensional information, thus making the generated action sequence matching the target speech more accurate.

[0049] On the other hand, the above scheme takes into account the prosodic features of the target speech when generating the action sequence corresponding to the target speech, so that the generated action sequence matches the prosodic of the target speech. This makes the actions represented by the generated action sequence that matches the target speech more realistic and natural. When outputting the target speech, the virtual image is driven by the action sequence that matches the target speech, which makes the voice and body movements of the virtual image more natural and coordinated. Attached Figure Description

[0050] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of this application. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0051] Figure 1 A flowchart illustrating a speech-based action generation method provided in an embodiment of this application;

[0052] Figure 2 This is a schematic diagram of the structure of the multi-action intent recognition model provided in the embodiments of this application;

[0053] Figure 3 This is a schematic diagram of the structure of a vector quantization variational autoencoder network provided in an embodiment of this application;

[0054] Figure 4 A schematic diagram illustrating the training process of the prosodic feature extraction model provided in this application embodiment;

[0055] Figure 5 A schematic diagram of the structure of a voice-based action generation device provided in an embodiment of this application;

[0056] Figure 6 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0057] The technical solution of this application is applicable to application scenarios where body movements are generated based on speech and matched with the speech. By employing the technical solution of this application, more natural body movements that match the speech can be generated. Specifically, the technical solution of this application can be applied to human-computer interaction scenarios, generating body movements that match the output speech when the virtual avatar outputs speech. These body movements can drive the virtual avatar to perform body movements that match the output speech, thereby reproducing human-to-human communication scenarios with body movements in human-computer interaction or machine interaction scenarios.

[0058] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0059] Exemplary methods

[0060] This application first proposes a speech-based action generation method, see [link to previous document]. Figure 1 As shown, the method includes:

[0061] S101. Determine the action intent contained in the target speech, and determine the first action sequence that matches the action intent.

[0062] Specifically, the target speech mentioned above refers to speech collected from human-computer interaction or machine interaction scenarios, such as speech output or to be output by a device. This target speech can be any language, any content, and any length.

[0063] The aforementioned action intent refers to the intention to perform a certain action as expressed in the target speech. For example, if the target speech contains phrases such as "I'll hit you!" or "I'm walking," then it is clear that the speech contains the action intent of "hitting" or "walking."

[0064] By performing action intent recognition processing on the target speech, the action intent contained within it can be determined. For example, by training an action intent recognition model with a large number of speech samples containing action intent, it can be used to identify the action intent contained in the target speech.

[0065] After determining the action intention contained in the target speech, a first action sequence matching that action intention is identified. This first action sequence is the sequence of limb movements that match the action intention.

[0066] For example, assuming the target speech contains the action intention of "walking", then the first action sequence that matches this action intention is the action sequence of the action of "walking".

[0067] As an example implementation, an action sequence library can be pre-built, containing action sequences corresponding to various body movements. When an action intention is recognized from the target speech, the action sequence library is retrieved to find the action sequence that matches the action intention, thus obtaining the first action sequence that matches the action intention.

[0068] And, S102, extracting speech prosodic features from the target speech, and predicting a second action sequence that matches the speech prosodic features based on the speech prosodic features.

[0069] The aforementioned prosodic features refer to the characteristics contained in the target speech that reflect the rhythm of pronunciation. To a certain extent, prosodic features reflect the emotions and attitudes of the speaker. In human-to-human interaction scenarios, the body movements of the person interacting with the speech will change with the rhythm of their spoken words. That is, there is a matching relationship between the prosodic features of the speech and the corresponding body movements.

[0070] For example, suppose the target speech is "Let's cheer each other on!" When a user says this with enthusiasm, they typically raise their voice and emphasize the word "cheering on" at the end, accompanied by an arm-raising gesture. This continuous vocalization and body language is very common in everyday human-to-human interactions, and similar scenarios are countless. These natural scenarios demonstrate an intrinsic connection between vocal rhythm and vocal body language.

[0071] Therefore, in order to make human-computer interaction as natural as human-to-human interaction, embodiments of this application extract speech prosodic features from the target speech and predict a second action sequence that matches the extracted speech prosodic features based on the extracted speech prosodic features.

[0072] For example, a speech prosody feature extraction model can be pre-trained, and the model can be used to extract speech prosody features from the target speech. Additionally, sample data of speech prosody features and corresponding action sequences can be pre-collected, and an action prediction model based on the speech prosody features can be trained. This model can then be used to predict the limb movements corresponding to the extracted speech prosody features, and the action sequence corresponding to the limb movements can be determined, thus obtaining a second action sequence that matches the speech prosody features.

[0073] The aforementioned body movements include, but are not limited to, movements of one or more body parts such as the hands, arms, head, torso, legs, feet, and face.

[0074] The execution order of steps S101 and S102 described above can be flexibly adjusted and is not limited to this. Figure 1 The execution order shown is S101 first and then S102. Alternatively, S102 can be executed first and then S101, or both S101 and S102 can be executed simultaneously.

[0075] S103. The first action sequence and the second action sequence are fused to generate an action sequence that matches the target speech.

[0076] Specifically, the above steps S101 and S102 identify the action intention from the target speech and extract the speech prosody features from it. They also determine the corresponding first action sequence and second action sequence based on the action intention and speech prosody features, respectively. In other words, they determine the action sequence that matches the target speech from different dimensions.

[0077] Based on this, the embodiments of this application perform fusion processing on the first action sequence and the second action sequence to generate an action sequence that matches the target speech.

[0078] That is, by fusing the first action sequence determined by the action intention with the second action sequence determined by the speech prosody features, the prediction of the action sequence matching the target speech from different dimensions can be realized, which can make the predicted action sequence matching the target speech more accurate.

[0079] For example, an action sequence fusion model can be pre-trained to achieve the fusion processing of the first action sequence and the second action sequence. The fused action sequence is the action sequence that matches the target speech.

[0080] As described above, the speech-based action generation method proposed in this application can determine the action intent contained in the target speech, determine a first action sequence matching the action intent, and extract speech prosodic features from the target speech and predict a second action sequence matching the speech prosodic features based on the speech prosodic features. The above processing obtains the first and second action sequences matching the target speech through different dimensions. Based on this, the first and second action sequences are fused to obtain the action sequence matching the target speech. The above scheme generates an action sequence matching the target speech through multi-dimensional information, thereby making the generated action sequence matching the target speech more accurate.

[0081] On the other hand, the above scheme takes into account the prosodic features of the target speech when generating the action sequence corresponding to the target speech, so that the generated action sequence matches the prosodic of the target speech. This makes the actions represented by the generated action sequence that matches the target speech more realistic and natural. When outputting the target speech, the virtual image is driven by the action sequence that matches the target speech, which makes the voice and body movements of the virtual image more natural and coordinated.

[0082] As an optional implementation, this application embodiment determines the action intent contained in the target speech by performing action intent classification processing based on preset action intent category labels on the target speech.

[0083] The aforementioned preset action intention category labels are determined by using various pre-collected action intentions as category labels. For example, assuming known action intentions include greeting, pointing, and praising, each of these known action intentions can be used as a category label, resulting in multiple different action intention category labels.

[0084] Based on the aforementioned preset action intent category label settings, the target speech can be classified into action intents. For example, based on the feature vector of the target speech, its matching degree with each action intent category label is calculated, and one or more action intent category labels with the highest matching degree are selected to obtain the action intent contained in the target speech.

[0085] As a preferred implementation, the embodiments of this application perform binary classification processing on the target speech for each preset action intention category label; then, based on the binary classification results for each preset action intention category label, the action intention contained in the target speech is determined.

[0086] Specifically, for each preset action intention category label, the target speech is subjected to binary classification. Specifically, for each preset action intention category label, it is determined whether the speech contains the action intention corresponding to that label. For example, if the classification result for a certain preset action intention category label is 0, it means that the target speech does not contain the action intention corresponding to that label; if the classification result is 1, it means that the target speech includes the action intention corresponding to that label.

[0087] For example, embodiments of this application pre-train a multi-action intent recognition model for recognizing action intent from speech.

[0088] First, sentence-level action intent annotation is performed on the collected scene text corpus in human-to-human or human-computer interaction scenarios to obtain training data for the multi-action intent recognition model.

[0089] Secondly, this application embodiment constructs a multi-action intent recognition model based on an attention mechanism. The backbone model of this multi-action intent recognition model is based on a Transformer structure. The input is words w1, w2 to wL after text segmentation. Since there may be multiple action intents in the same sentence, the model sets an independent classification head for each action intent. The features of each word are weighted through an attention mechanism, and then a binary classification is performed to determine whether the corresponding action intent exists. The specific model structure is as follows: Figure 2 As shown; finally, by using the above training data to perform sentence-level action intent recognition on the multi-action intent recognition model, the training of the recognition model can be completed.

[0090] During the usage phase, the text corresponding to the target speech is input into the multi-action intent recognition model described above to achieve the recognition of multi-action intents.

[0091] Furthermore, since the attention mechanism precisely reflects the importance of each word in the text to the corresponding action intention, by selecting the word with the highest weight for identifying the corresponding action intention, the position range of the corresponding action intention in the target speech can be determined, thus realizing the localization of the action intention in the target speech.

[0092] For example, suppose the target speech is "I know the place you're talking about, just keep walking in this direction and you'll get there." By performing multi-action intent recognition on the text of this target speech, we can determine that it contains the action intent of "guiding." Furthermore, through the multi-action intent recognition model described above, we can determine that "this direction" in the sentence "I know the place you're talking about, just keep walking in this direction and you'll get there" has the greatest weight for recognizing the action intent of "guiding." Therefore, we can determine that the action intent of "guiding" is located in the position range of the sentence "I know the place you're talking about, just keep walking in this direction and you'll get there," specifically in the position range of the words "this direction" mentioned in the sentence.

[0093] In implementing the technical solutions of this application, other methods can also be used to determine the location range of the action intent contained in the target speech. For example, the action intent can be classified or identified first to determine the action intent contained therein. Then, the text of the target speech is segmented, and the matching degree between each segment and the action intent contained in the target speech is calculated. The position of the segment with the highest matching degree with the action intent in the target speech is determined as the location range of the action intent contained in the target speech.

[0094] When the position range of the action intent contained in the target speech is determined, when the first action sequence and the second action sequence are fused to generate an action sequence that matches the target speech, the embodiment of this application replaces the action sequence in the second action sequence that is located in the position range with the first action sequence, and decodes the replaced second action sequence to obtain an action sequence that matches the target speech.

[0095] Specifically, since the prosodic features of the target speech are present throughout the entire target speech, the second action sequence determined based on these prosodic features matches the length of the target speech. However, the action intention in the target speech is only the action intention expressed by one or a few words within the target speech, and the length of the first action sequence matching this action intention is shorter than the length of the target speech.

[0096] Therefore, after determining a first action sequence matching the action intent contained in the target speech and a second action sequence matching the prosodic features of the target speech, a position interval corresponding to the position interval of the action intent in the target speech is determined from the second action sequence; this is assumed to be the first position interval. Then, the action sequence in this first position interval is replaced with the aforementioned first action sequence, and the replaced second action sequence is decoded to obtain the action sequence matching the target speech.

[0097] Taking the above example, suppose the target speech is "I know this place you mentioned, just keep walking in this direction and you'll get there." According to the technical solution of this application, the intention of "guiding" is identified from the target speech, and the first action sequence corresponding to this intention is determined as the action sequence of "guiding." Simultaneously, prosodic features are extracted from the target speech, and the corresponding second action sequence is determined based on the extracted prosodic features.

[0098] Then, it can be determined that the location interval of the action intention of "guidance" in the target speech is the location interval of "this direction" in the sentence. Based on this location interval, the corresponding location interval is determined from the second action sequence, for example, the location interval of the 15th to 18th action sequence frames.

[0099] Finally, the 15th to 18th action sequence frames in the second action sequence are replaced with the action sequence of the "guidance" action, and the replaced second action sequence is decoded to obtain the action sequence that matches the target speech.

[0100] For example, the length of the prosodic features extracted from the target speech, or the length of the second action sequence, can be set to the same length as the target speech. This ensures that the final generated action sequence matching the target speech has the same length as the target speech. In this case, while playing the target speech, using the action sequence matching the target speech to drive the virtual character to perform actions can synchronize the actions performed by the virtual character with the target speech, making its actions more natural and coordinated.

[0101] For example, after generating the action sequence corresponding to the statement "I know this place you mentioned, just keep walking in this direction and you'll get there" according to the above processing, while playing the statement, the virtual character can be driven to perform physical actions using the action sequence corresponding to the statement. This allows the virtual character to perform "guiding" physical actions when the voice plays "this direction," making the whole process more natural, coordinated, and closer to a human-to-human interaction scenario.

[0102] As an optional implementation, once the action intent contained in the target speech is determined, the action codebook that matches the action intent is selected from the pre-built action codebook set to generate the first action sequence.

[0103] The action codebooks in the aforementioned action codebook set are used to combine to obtain any action sequence.

[0104] Specifically, the aforementioned action codebooks are the basic action units that constitute an action sequence. Any action sequence can be obtained by combining one or more action codebooks from the action codebook set.

[0105] In order to reduce the data space for generating or synthesizing action sequences and improve the efficiency of action generation, this application embodiment decomposes various limb actions into action units, thereby summarizing some action units that can be used to combine to obtain any limb action. These action units constitute an action codebook set as an action codebook.

[0106] Specifically, the above-mentioned actioncode set can be obtained through the following processing:

[0107] First, the acquired action sequence is split to obtain the action sequence unit corresponding to each action sequence.

[0108] The acquired action sequences include sequences of actions from the semantic action library, as well as continuous action sequences of the target object in the speaking state.

[0109] The target object mentioned above can be any natural person or intelligent robot. The continuous action sequence of the target object in a speaking state represents the continuous limb movement sequence of the target object when speaking, collected in a natural scene.

[0110] The obtained action sequence is split into action sequence units, for example, each L-frame action sequence is split into one action sequence unit.

[0111] Then, by encoding the action sequence units corresponding to each action sequence and using the encoding results to recover the action sequence, the action code corresponding to each action sequence unit is determined.

[0112] Specifically, the action sequence units obtained through the above processing are used as training samples to train the vector quantization variational autoencoder network for action sequence encoding and decoding.

[0113] See Figure 3As shown, specifically, the action sequence is input to a vector quantization variational autoencoder network composed of these action sequence units. This network encodes the action sequence in units of the action sequence to obtain the action codes corresponding to each action sequence unit. The input action sequence is then reconstructed using the action codes corresponding to each action sequence unit. In this way, different actions can be generated using the action codes of the action sequence units.

[0114] In the embodiments of this application, the encoder and decoder of the vector quantization variational autoencoder network are both constructed based on the convolutional neural network (CNN) structure. Other network structures such as Transformer and graph convolution can also be used.

[0115] After the above training process, the vector quantization variational autoencoder network can determine the action codes of each action sequence unit obtained from the acquired action sequence. Based on the action codes of these action units, the action sequence can be reconstructed.

[0116] Finally, the action codes corresponding to all action sequence units are deduplicated, and the deduplicated action codes are used as action codebooks to form an action codebook set.

[0117] Specifically, in the above process, by splitting different action sequences into action sequence units, encoding the action sequence units, and reconstructing the action sequences based on the encoding results of the action sequence units, the action codes of the action sequence units contained in each action sequence are determined.

[0118] Different action sequences may contain the same action sequence units. Therefore, in this embodiment, duplicate action sequence unit action codes are removed from the action codes of the action sequence units contained in each action sequence, and the remaining action codes corresponding to each action sequence unit constitute an action codebook set. That is, the action code corresponding to each action unit is a single action codebook, and all action codebooks constitute an action codebook set.

[0119] Furthermore, based on the aforementioned processing using the vector quantization variational autoencoder network, it is possible to determine which action codebooks can be combined to obtain each action sequence, and how to combine the action codebooks, i.e., to determine the correspondence between action sequences and action codebooks. Based on this correspondence, when it is necessary to synthesize an action sequence for a certain action, the corresponding action codebooks are obtained from the action codebook set and combined according to the corresponding action codebook combination order to obtain the desired action sequence.

[0120] Based on the aforementioned set of action codebooks, when an action intention is recognized from the target speech, the action corresponding to the action intention is first determined, such as "guidance". Then, action codebooks that can be combined to obtain the action are selected from the aforementioned set of action codebooks. The combination method of these action codebooks to obtain the action can also be determined. By combining these action codebooks according to the combination method, the first action sequence corresponding to the action intention can be generated.

[0121] In one embodiment of this application, a prosodic feature extraction model is pre-trained to extract prosodic features from speech. By inputting the target speech into the pre-trained prosodic feature extraction model, the prosodic features of the target speech output by the model can be obtained.

[0122] This prosodic feature extraction model can be a prosodic feature extraction model obtained by training a neural network model of any type and structure through prosodic feature extraction.

[0123] Specifically, the prosodic feature extraction model described above is obtained through joint training of a first training method and / or a second training method. The first training method is used to train the prosodic feature extraction model to extract speech prosodic features from the input speech, and the second training method is used to train the prosodic feature extraction model to filter out voiceprint features and text features when extracting speech prosodic features from the input speech.

[0124] When training the prosodic feature extraction model, the first training method described above can be used to train it for prosodic feature extraction until the model can accurately extract the prosodic features of the input speech.

[0125] Alternatively, the second training method described above can be used. This method is mainly used to train the prosodic feature extraction model to improve the purity of the prosodic features extracted. During the training process, the prosodic feature extraction model becomes increasingly capable of identifying and filtering out voiceprint and text features from the prosodic features, thereby making the prosodic features extracted by the prosodic feature extraction model purer and more accurate.

[0126] Alternatively, the first and second training methods described above can be combined, i.e., the prosodic feature extraction model can be trained using both training methods.

[0127] As a preferred training scheme, this application embodiment employs a combination of the first and second training methods to train the prosodic feature extraction model. The simultaneous application of these two training methods to the prosodic feature extraction model enables it to extract more accurate prosodic features from the input speech and accelerates model convergence.

[0128] Specifically, since prosodic features are difficult to annotate and training samples with prosodic feature annotations are not easy to obtain, this application adopts a self-supervised speech prosodic feature extraction training scheme.

[0129] First, a batch of voices from different people were collected as a training set. At the same time, the corresponding text, text features, and voiceprint features were determined through automatic speech recognition.

[0130] Secondly, a prosodic feature extraction model is constructed.

[0131] The next step is to train the prosodic feature extraction model:

[0132] See Figure 4 As shown, the sample spectrogram is input into the prosodic feature extraction model to obtain the prosodic features extracted from the sample spectrogram by the prosodic feature extraction model.

[0133] The aforementioned sample spectrogram is the spectrogram of the sample speech used to train the prosodic feature extraction model. The prosodic feature extraction model extracts prosodic features from the sample spectrogram, then performs average pooling on the extracted prosodic features according to the word boundaries of the sample spectrogram, expanding them to the length of the word interval. This ensures that the prosodic feature of each extracted word is the same length as the interval it occupies. This results in the length of the extracted prosodic features being the same as the length of the spectrogram, and the length of the prosodic feature corresponding to each word being the same as the length of the spectrogram for each word. This setup facilitates the prediction of the second action sequence using the prosodic features of the speech, ensuring that the length is the same as the length of the speech and that it strictly matches each word of the speech.

[0134] Next, the process will be divided into two branches, and the loss function will be calculated:

[0135] The first approach involves using the extracted prosodic features, as well as the voiceprint and text features of the sample spectrograms, to reconstruct the spectrograms and obtain the reconstructed spectrograms. Based on the sample spectrograms and the reconstructed spectrograms, the spectrogram reconstruction loss is calculated.

[0136] The second approach involves using the extracted prosodic features to reconstruct voiceprint features and text features, resulting in reconstructed voiceprint features and reconstructed text features. Based on the voiceprint features and reconstructed voiceprint features, a first adversarial loss is calculated, and based on the text features and reconstructed text features, a second adversarial loss is calculated.

[0137] The operational parameters of the prosodic feature extraction model are corrected with the goal of the spectrogram reconstruction loss being less than a preset first loss threshold and the first adversarial loss and the second adversarial loss being greater than a preset second loss threshold.

[0138] The first branch mentioned above corresponds to the first training method mentioned above. This training method trains the speech prosodic feature extraction function of the prosodic feature extraction model in a self-supervised manner, so that it can extract accurate speech prosodic features and ensure that speech can be reconstructed based on the extracted speech prosodic features, that is, ensure the correctness of the extracted speech prosodic features.

[0139] The second branch mentioned above corresponds to the second training method, which is adversarial training. This method allows the prosodic feature extraction model to filter out the voiceprint and text features extracted from the speech when extracting prosodic features from the input speech, thereby making the extracted prosodic features purer.

[0140] After the above training, the prosodic feature extraction model can accurately extract pure prosodic features from the input speech.

[0141] In one embodiment of this application, an action sequence prediction model is pre-trained to predict the action codebook contained in actions that match the prosodic features of speech, and to compose an action codebook sequence using the predicted action codebook.

[0142] Specifically, for the continuous action sequence of the target object in the speaking state, as described in the above embodiment, the action codebook contained in the action sequence and the speech prosodic features of the action sequence can be determined. The action codebook and speech prosodic features corresponding to the same action sequence are combined into training samples to train the action sequence prediction model. This enables the action sequence prediction model to predict the corresponding action codebook based on the speech prosodic features, and to form an action codebook sequence using the predicted action codebook. This action codebook sequence is the action sequence corresponding to the speech prosodic features of the input model.

[0143] Based on the above action sequence prediction model, when predicting the second action sequence corresponding to the speech prosody features based on the speech prosody features of the target speech, the speech prosody features of the target speech are input into the above action sequence prediction model to obtain the action sequence predicted by the model. This action sequence can be used as the second action sequence that matches the speech prosody features of the target speech.

[0144] Exemplary device

[0145] Accordingly, embodiments of this application also provide a voice-based action generation device, see [link to relevant documentation]. Figure 5 As shown, the device includes:

[0146] The first action prediction unit 100 is used to determine the action intention contained in the target speech and to determine a first action sequence that matches the action intention;

[0147] The second action prediction unit 110 is used to extract speech prosodic features from the target speech and predict a second action sequence that matches the speech prosodic features based on the speech prosodic features.

[0148] The action synthesis unit 120 is used to fuse the first action sequence and the second action sequence to generate an action sequence that matches the target speech.

[0149] As an optional implementation, determining the action intent contained in the target speech includes:

[0150] The target speech is subjected to action intent classification processing based on preset action intent category labels to determine the action intent contained in the target speech.

[0151] As an optional implementation, the step of performing action intent classification processing on the target speech based on preset action intent category labels to determine the action intent contained in the target speech includes:

[0152] For each preset action intent category label, the target speech is subjected to binary classification processing;

[0153] Based on the binary classification results corresponding to each preset action intention category label, the action intention contained in the target speech is determined;

[0154] The binary classification process includes determining whether the target speech contains the action intent corresponding to the action intent category label.

[0155] As an optional implementation, determining the first action sequence matching the action intent includes:

[0156] From a pre-built set of action codebooks, select action codebooks that match the action intent to generate a first action sequence;

[0157] The action codebooks in the action codebook set are used to combine to obtain arbitrary action sequences.

[0158] As an optional implementation, the actioncode set is obtained through the following processing:

[0159] The acquired action sequence is split to obtain the action sequence unit corresponding to each action sequence. The acquired action sequence includes the sequence of actions in the semantic action library and the continuous action sequence of the target object in the speaking state.

[0160] By encoding the action sequence units corresponding to each action sequence and using the encoding results to recover the action sequence, the action code corresponding to each action sequence unit is determined.

[0161] The action codes corresponding to all action sequence units are deduplicated, and the deduplicated action codes are used as action codebooks to form an action codebook set.

[0162] As an optional implementation, speech prosodic features are extracted from the target speech, including:

[0163] The target speech is input into a pre-trained prosodic feature extraction model to obtain the speech prosodic features of the target speech;

[0164] The prosodic feature extraction model is trained through a first training method and / or a second training method. The first training method is used to train the prosodic feature extraction model to extract speech prosodic features from the input speech, and the second training method is used to train the prosodic feature extraction model to filter out voiceprint features and text features when extracting speech prosodic features from the input speech.

[0165] As an optional implementation, the training process of the prosodic feature extraction model includes:

[0166] Input the sample spectrogram into the prosodic feature extraction model to obtain the prosodic features extracted by the prosodic feature extraction model from the sample spectrogram;

[0167] The reconstructed spectrogram is obtained by using the prosodic features, as well as the voiceprint and text features of the sample spectrogram.

[0168] Based on the sample spectrogram and the reconstructed spectrogram, the spectrogram reconstruction loss is calculated;

[0169] The prosodic features are used to reconstruct voiceprint features and text features, resulting in reconstructed voiceprint features and reconstructed text features.

[0170] Based on the voiceprint features and the reconstructed voiceprint features, a first adversarial loss is calculated, and based on the text features and the reconstructed text features, a second adversarial loss is calculated.

[0171] The operational parameters of the prosodic feature extraction model are corrected with the goal of the spectrogram reconstruction loss being less than a preset first loss threshold and the first adversarial loss and the second adversarial loss being greater than a preset second loss threshold.

[0172] As an optional implementation, a second action sequence matching the speech prosodic features is predicted based on the speech prosodic features, including:

[0173] The speech prosody features are input into a pre-trained action sequence prediction model to obtain an action codebook sequence that matches the speech prosody features, and the obtained action codebook sequence is used as a second action sequence.

[0174] The action sequence prediction model is used to predict the action codebook contained in the actions that match the speech prosodic features, and to use the predicted action codebook to form an action codebook sequence. The action codebook is the action code corresponding to the action sequence unit used to form a continuous action sequence.

[0175] As an optional implementation, the first action prediction unit is further configured to determine the position range of the action intention in the target speech;

[0176] The first action sequence and the second action sequence are fused to generate an action sequence that matches the target speech, including:

[0177] The action sequence in the second action sequence that is located in the first position interval corresponding to the position interval is replaced with the first action sequence, and the replaced second action sequence is decoded to obtain an action sequence that matches the target speech.

[0178] The voice-based action generation device provided in this embodiment belongs to the same concept as the voice-based action generation method provided in the above embodiments of this application. It can execute the voice-based action generation method provided in any of the above embodiments of this application and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment can be found in the specific processing content of the voice-based action generation method provided in the above embodiments of this application, and will not be repeated here.

[0179] Exemplary electronic devices

[0180] Another embodiment of this application also provides an electronic device, see [link to relevant documentation] Figure 6 As shown, the device includes:

[0181] Memory 200 and processor 210;

[0182] The memory 200 is connected to the processor 210 and is used to store programs;

[0183] The processor 210 is configured to implement the voice-based action generation method disclosed in any of the above embodiments by running the program stored in the memory 200.

[0184] Specifically, the aforementioned electronic device may also include: a bus, a communication interface 220, an input device 230, and an output device 240.

[0185] The processor 210, memory 200, communication interface 220, input device 230, and output device 240 are interconnected via a bus. Among them:

[0186] A bus can include a pathway for transmitting information between various components of a computer system.

[0187] The processor 210 can be a general-purpose processor, such as a general-purpose central processing unit (CPU), a microprocessor, etc., or an application-specific integrated circuit (ASIC), or one or more integrated circuits used to control the execution of the program of the present invention. It can also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

[0188] Processor 210 may include a main processor, as well as a baseband chip, modem, etc.

[0189] The memory 200 stores a program that executes the technical solution of this invention, and may also store an operating system and other key business functions. Specifically, the program may include program code, which includes computer operation instructions. More specifically, the memory 200 may include read-only memory (ROM), other types of static storage devices capable of storing static information and instructions, random access memory (RAM), other types of dynamic storage devices capable of storing information and instructions, disk storage, flash memory, etc.

[0190] Input device 230 may include a device for receiving user input data and information, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor.

[0191] Output device 240 may include devices that allow information to be output to a user, such as a display screen, printer, speaker, etc.

[0192] The communication interface 220 may include a device that uses any transceiver to communicate with other devices or communication networks, such as Ethernet, Radio Access Network (RAN), Wireless Local Area Network (WLAN), etc.

[0193] The processor 210 executes the program stored in the memory 200 and calls other devices, which can be used to implement any of the steps of the voice-based action generation method provided in the above embodiments of this application.

[0194] Exemplary computer program products and storage media

[0195] In addition to the methods and devices described above, embodiments of this application may also be computer program products, which include computer program instructions that, when executed by a processor, cause the processor to perform the steps in the speech-based action generation method described in the "Exemplary Methods" section of this specification.

[0196] The computer program product can be written in any combination of one or more programming languages ​​to perform the operations of the embodiments of this application. The programming languages ​​include object-oriented programming languages ​​such as Java and C++, as well as conventional procedural programming languages ​​such as C or similar languages. The program code can be executed entirely on the user's computing device, partially on the user's computing device, as a standalone software package, partially on the user's computing device and partially on a remote computing device, or entirely on a remote computing device or server.

[0197] Furthermore, embodiments of this application may also be storage media storing a computer program, which is executed by a processor in the steps of the voice-based action generation method described in the "Exemplary Methods" section above.

[0198] For the foregoing method embodiments, in order to simplify the description, they are all described as a series of actions. However, those skilled in the art should understand that this application is not limited to the described order of actions, because according to this application, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily essential to this application.

[0199] It should be noted that the various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For apparatus embodiments, since they are basically similar to method embodiments, the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments.

[0200] The steps in the methods of the various embodiments of this application can be adjusted, merged, or deleted in order according to actual needs, and the technical features described in each embodiment can be replaced or combined.

[0201] The modules and sub-modules in the various embodiments of the present application's devices and terminals can be merged, divided, and deleted according to actual needs.

[0202] It should be understood that the disclosed terminals, devices, and methods can be implemented in other ways, given the several embodiments provided in this application. For example, the terminal embodiments described above are merely illustrative. For instance, the division of modules or sub-modules is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple sub-modules or modules may be combined or integrated into another module, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices, or modules, and may be electrical, mechanical, or other forms.

[0203] The modules or submodules described as separate components may or may not be physically separate. The components that constitute a module or submodule may or may not be physical modules or submodules; that is, they may be located in one place or distributed across multiple network modules or submodules. Some or all of the modules or submodules can be selected to achieve the purpose of this embodiment's solution, depending on actual needs.

[0204] Furthermore, the functional modules or sub-modules in the various embodiments of this application can be integrated into one processing module, or each module or sub-module can exist physically separately, or two or more modules or sub-modules can be integrated into one module. The integrated modules or sub-modules described above can be implemented in hardware or in the form of software functional modules or sub-modules.

[0205] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0206] The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein can be implemented directly by hardware, a software unit executed by a processor, or a combination of both. The software unit can be located in random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art.

[0207] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0208] The above description of the disclosed embodiments enables those skilled in the art to make or use this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A speech-based action generation method, characterized in that, include: Determine the action intent contained in the target speech, and determine a first action sequence that matches the action intent; Furthermore, the target speech is input into a pre-trained prosodic feature extraction model to obtain the speech prosodic features of the target speech, and a second action sequence matching the speech prosodic features is predicted based on the speech prosodic features. The first action sequence and the second action sequence are fused together to generate an action sequence that matches the target speech. The training process of the prosodic feature extraction model includes: Input the sample spectrogram into the prosodic feature extraction model to obtain the prosodic features extracted by the prosodic feature extraction model from the sample spectrogram; The reconstructed spectrogram is obtained by using the prosodic features, as well as the voiceprint and text features of the sample spectrogram. Based on the sample spectrogram and the reconstructed spectrogram, the spectrogram reconstruction loss is calculated; The prosodic features are used to reconstruct voiceprint features and text features, resulting in reconstructed voiceprint features and reconstructed text features. Based on the voiceprint features and the reconstructed voiceprint features, a first adversarial loss is calculated, and based on the text features and the reconstructed text features, a second adversarial loss is calculated. The operational parameters of the prosodic feature extraction model are corrected with the goal of the spectrogram reconstruction loss being less than a preset first loss threshold and the first adversarial loss and the second adversarial loss being greater than a preset second loss threshold.

2. The method according to claim 1, characterized in that, Determining the action intent contained in the target speech includes: The target speech is subjected to action intent classification processing based on preset action intent category labels to determine the action intent contained in the target speech.

3. The method according to claim 2, characterized in that, The step of classifying the target speech based on preset action intent category labels to determine the action intent contained in the target speech includes: For each preset action intent category label, the target speech is subjected to binary classification processing; Based on the binary classification results corresponding to each preset action intention category label, the action intention contained in the target speech is determined; The binary classification process includes determining whether the target speech contains the action intent corresponding to the action intent category label.

4. The method according to claim 1, characterized in that, The determination of the first action sequence matching the action intention includes: From a pre-built set of action codebooks, select action codebooks that match the action intent to generate a first action sequence; The action codebooks in the action codebook set are used to combine to obtain arbitrary action sequences.

5. The method according to claim 4, characterized in that, The action code set is obtained through the following processing: The acquired action sequence is split to obtain the action sequence unit corresponding to each action sequence. The acquired action sequence includes the sequence of actions in the semantic action library and the continuous action sequence of the target object in the speaking state. By encoding the action sequence units corresponding to each action sequence and using the encoding results to recover the action sequence, the action code corresponding to each action sequence unit is determined. The action codes corresponding to all action sequence units are deduplicated, and the deduplicated action codes are used as action codebooks to form an action codebook set.

6. The method according to claim 1, characterized in that, Based on the prosodic features, a second action sequence matching the prosodic features is predicted, including: The speech prosody features are input into a pre-trained action sequence prediction model to obtain an action codebook sequence that matches the speech prosody features, and the obtained action codebook sequence is used as a second action sequence. The action sequence prediction model is used to predict the action codebook contained in the actions that match the speech prosodic features, and to use the predicted action codebook to form an action codebook sequence. The action codebook is the action code corresponding to the action sequence unit used to form a continuous action sequence.

7. The method according to claim 1, characterized in that, The method further includes determining the location range of the action intention in the target speech; The first action sequence and the second action sequence are fused to generate an action sequence that matches the target speech, including: The action sequence in the second action sequence that is located in the first position interval corresponding to the position interval is replaced with the first action sequence, and the replaced second action sequence is decoded to obtain an action sequence that matches the target speech.

8. A speech-based action generation device, characterized in that, include: The first action prediction unit is used to determine the action intention contained in the target speech and to determine a first action sequence that matches the action intention; The second action prediction unit is used to input the target speech into a pre-trained prosodic feature extraction model to obtain the speech prosodic features of the target speech, and predict a second action sequence that matches the speech prosodic features based on the speech prosodic features. An action synthesis unit is used to fuse the first action sequence and the second action sequence to generate an action sequence that matches the target speech; The training process of the prosodic feature extraction model includes: Input the sample spectrogram into the prosodic feature extraction model to obtain the prosodic features extracted by the prosodic feature extraction model from the sample spectrogram; The reconstructed spectrogram is obtained by using the prosodic features, as well as the voiceprint and text features of the sample spectrogram. Based on the sample spectrogram and the reconstructed spectrogram, the spectrogram reconstruction loss is calculated; The prosodic features are used to reconstruct voiceprint features and text features, resulting in reconstructed voiceprint features and reconstructed text features. Based on the voiceprint features and the reconstructed voiceprint features, a first adversarial loss is calculated, and based on the text features and the reconstructed text features, a second adversarial loss is calculated. The operational parameters of the prosodic feature extraction model are corrected with the goal of the spectrogram reconstruction loss being less than a preset first loss threshold and the first adversarial loss and the second adversarial loss being greater than a preset second loss threshold.

9. An electronic device, characterized in that, include: Memory and processor; The memory is connected to the processor and is used to store programs; The processor is configured to implement the speech-based action generation method as described in any one of claims 1 to 7 by running a program in the memory.

10. A storage medium, characterized in that, The storage medium stores a computer program, which, when executed by a processor, implements the speech-based action generation method as described in any one of claims 1 to 7.