A multi-modal speech interaction method and system for humanoid robots

CN122245316APending Publication Date: 2026-06-19SHENZHEN YING ZHAOJIA TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHENZHEN YING ZHAOJIA TECHNOLOGY CO LTD
Filing Date
2026-04-29
Publication Date
2026-06-19

Smart Images

  • Figure CN122245316A_ABST
    Figure CN122245316A_ABST
Patent Text Reader

Abstract

This application provides a multimodal voice interaction method and system for humanoid robots, belonging to the field of voice interaction technology. The method includes: acquiring a user's voice stream, a user's facial image sequence, and a user's gesture image sequence in a target voice interaction scenario; extracting facial expression features and gesture motion features from the user's facial image sequence and gesture image sequence, respectively, and determining the user's non-verbal modality's intent correction factor for the speech semantics; using the intent correction factor to semantically adjust the dialogue response text generated from the user's voice stream, generating adjusted dialogue response text, and then generating response speech; identifying the micro-motion triggering sequence during the playback of the response speech based on the determined speech pause intervals in the user's voice stream, and performing interactive feedback based on the micro-motion triggering sequence. The technical solution provided by this application can achieve cognitive correction of the non-verbal modality for the verbal modality in multimodal voice interaction of humanoid robots.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of voice interaction technology, and more specifically, to a multimodal voice interaction method and system for humanoid robots. Background Technology

[0002] Voice interaction aims to enable machines to communicate through natural speech, just like humans. Its core chain includes automatic speech recognition, natural language understanding, and speech synthesis. Early systems were based on template matching and limited vocabulary, resulting in limited accuracy. The deep learning revolution has brought significant breakthroughs: end-to-end models have greatly improved recognition accuracy in noisy environments, attention mechanisms and Transformer architectures have made semantic understanding more intelligent, and the speech generated by neural network vocoders has become close to the quality of real people. Today, voice is moving from simple command control to emotion recognition and anthropomorphic dialogue.

[0003] In existing voice interaction, the device first uses front-end signal processing technologies such as beamforming and noise suppression to locate and pick up the user's voice in complex environments. Then, automatic speech recognition converts the audio stream into text. Next, natural language understanding analyzes the text intent and extracts key information. The system performs corresponding operations or queries the knowledge base based on the analysis results. Finally, natural language generation and speech synthesis technologies convert the response content into natural speech feedback to the user, achieving smooth human-computer interaction in near-field or far-field scenarios. However, in voice interaction for humanoid robots, traditional human-computer voice interaction systems rely solely on a single voice modality for semantic understanding, failing to effectively integrate non-verbal cues such as facial expressions and gestures. This leads to ambiguity or deviation in the interpretation of the user's true intent. Especially when the voice content is inconsistent with non-verbal expressions, it is difficult to dynamically correct the semantic understanding results, resulting in inappropriate voice interaction feedback. Therefore, how to achieve cognitive correction of the language modality by the non-verbal modality in multimodal voice interaction for humanoid robots has become a challenge for the industry. Summary of the Invention

[0004] This application provides a multimodal voice interaction method and system for humanoid robots, which can realize the cognitive correction of non-verbal modalities on verbal modalities in multimodal voice interaction of humanoid robots.

[0005] In a first aspect, this application provides a multimodal voice interaction method for humanoid robots, comprising the following steps: Acquire user speech stream, user facial image sequence, and user gesture image sequence in the target voice interaction scenario; Facial expression features are extracted from the user's facial image sequence, and hand gesture motion features are extracted from the user's hand gesture image sequence. Based on the facial expression features and the hand gesture motion features, the user's non-verbal modality intention correction factor for speech semantics is determined. The user's voice stream is used to generate a dialogue response text, and the intent correction factor is used to semantically adjust the dialogue response text to generate an adjusted dialogue response text, which is then synthesized into a response voice. The speech pause intervals in the user's speech stream are determined, and the micro-movement triggering sequence of the humanoid robot during the playback of the response speech is identified based on the speech pause intervals. Then, based on the micro-movement triggering sequence, the humanoid robot is controlled to provide interactive feedback on non-semantic sounds within the speech pause intervals.

[0006] In some embodiments, extracting facial expression features from the user's facial image sequence specifically includes: Perform face region detection on each user face image in the user face image sequence to obtain the face bounding box region corresponding to each user face image; Facial feature points are located based on the bounding box region of the face image of each user, and the coordinate set of facial feature points corresponding to each user's face image is obtained. Based on the coordinate set of all facial feature points, the bounding box region of the face image of each user is aligned to the preset standard face template coordinate system by performing an affine transformation on the bounding box region of the face image of each user to obtain the aligned normalized face image sequence. For each user's facial image in the normalized face image sequence, the reference frame difference features and intra-frame texture deformation features are extracted respectively. The reference frame difference features and intra-frame texture deformation features of each user's facial image are then concatenated along the frame order to obtain facial expression features.

[0007] In some embodiments, extracting gesture motion features from the user gesture image sequence specifically includes: For each user gesture image in the user gesture image sequence, hand region detection is performed to obtain the hand bounding box region corresponding to each user gesture image; Hand skeleton keypoints are detected based on the bounding box region of the hand corresponding to each user's gesture image, and the coordinate set of the hand skeleton keypoints corresponding to each user's gesture image is obtained. The coordinate set of the hand skeleton keypoints corresponding to each user's gesture image is arranged in the frame order to form a spatiotemporal sequence of hand skeleton keypoints. Inter-frame difference calculation is performed on the coordinates of the same hand joint point in adjacent user gesture images in the spatiotemporal sequence of the hand bone key points to obtain the inter-frame displacement vector sequence of each hand joint point; The inter-frame displacement vector sequences of all hand joints are spatially normalized according to the joint topological adjacency relationship to obtain the gesture motion features.

[0008] In some embodiments, determining the user's nonverbal modality intent correction factor for speech semantics based on the facial expression features and the gesture motion features specifically includes: The facial expression features and the gesture motion features are time-aligned to obtain a synchronized sequence of non-verbal feature pairs. The facial expression features and gesture motion features at each time step in the non-verbal feature pair sequence are concatenated to obtain the fused non-verbal representation vector at each time step. The fused non-linguistic representation vectors at each time step are subjected to cross-modal attention interaction with the speech and semantic features of the corresponding time step to obtain the intent offset at each time step. The intent offsets at each time step are pooled and aggregated along the time dimension to obtain the intent correction factor of the user's non-verbal modality on speech semantics.

[0009] In some embodiments, generating dialogue response text from the user's voice stream specifically includes: Speech recognition is performed on the user's voice stream to obtain the user's text sequence; The user text sequence is segmented to obtain a word sequence, and the word sequence is mapped to word vectors to obtain a word embedding vector sequence. The word embedding vector sequence is subjected to contextual semantic encoding to obtain the speech semantic features. The speech semantic features are then input into a preset dialogue response generator to generate dialogue response text.

[0010] In some embodiments, determining the speech pause intervals in the user's speech stream specifically includes: Speech activity detection is performed on the user's speech stream to obtain a sequence of spoken segments and a sequence of silent segments; Non-lexical speech recognition is performed on each of the spoken segments in the spoken segment sequence to obtain the non-lexical speech segments and their start and end timestamps within each spoken segment. The non-lexical speech segments include filler pauses. The start and end timestamps of each non-lexical speech segment are extended forward and backward to the nearest silence boundary in the adjacent silence segment sequence to obtain each extended speech pause candidate interval. Extract the audio segment from the user's audio stream that corresponds to the candidate audio pause interval, and perform semantic coherence verification on the audio content before and after the audio segment. If the verification passes, the candidate audio pause interval is determined as the audio pause interval.

[0011] In some embodiments, a microphone array integrated into the head of a humanoid robot is used to acquire the user's voice stream in the target voice interaction scenario.

[0012] Secondly, this application provides a multimodal voice interaction system for humanoid robots, used to execute a multimodal voice interaction method for humanoid robots, the system comprising: The acquisition module is used to acquire user voice streams, user facial image sequences, and user gesture image sequences in the target voice interaction scenario; The processing module is used to extract facial expression features from the user's facial image sequence, extract gesture motion features from the user's gesture image sequence, and determine the user's non-verbal modality intention correction factor for speech semantics based on the facial expression features and the gesture motion features. The processing module is further configured to generate dialogue response text through the user's voice stream, and to perform semantic adjustment on the dialogue response text using the intent correction factor to generate adjusted dialogue response text, and then synthesize the adjusted dialogue response text into a response voice. The execution module is used to determine the speech pause intervals in the user's speech stream, identify the micro-movement triggering sequence of the humanoid robot during the playback of the response speech based on the speech pause intervals, and then control the humanoid robot to perform interactive feedback on non-semantic sounds within the speech pause intervals based on the micro-movement triggering sequence.

[0013] Thirdly, this application provides a computer device including a memory and a processor, the memory storing code, and the processor being configured to acquire the code and execute the above-described multimodal voice interaction method for humanoid robots.

[0014] Fourthly, this application provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the aforementioned multimodal voice interaction method for humanoid robots.

[0015] The technical solutions provided by the embodiments disclosed in this application have the following beneficial effects: The multimodal voice interaction method and system for humanoid robots provided in this application first acquires the user's voice stream, user facial image sequence, and user gesture image sequence in the target voice interaction scenario; secondly, facial expression features are extracted from the user's facial image sequence, and gesture motion features are extracted from the user's gesture image sequence, and the user's non-verbal modality intention correction factor for speech semantics is determined based on the facial expression features and the gesture motion features; then, a dialogue response text is generated through the user's voice stream, and the dialogue response text is semantically adjusted using the intention correction factor to generate an adjusted dialogue response text, which is then synthesized into a response voice; finally, the voice pause intervals in the user's voice stream are determined, and the micro-movement triggering sequence of the humanoid robot during the playback of the response voice is identified based on the voice pause intervals, and the humanoid robot is controlled to provide interactive feedback on non-semantic sounds within the voice pause intervals based on the micro-movement triggering sequence.

[0016] Therefore, this application can achieve cognitive correction of non-verbal modalities to verbal modalities in multimodal voice interaction of humanoid robots. Firstly, by simultaneously acquiring user speech streams, facial image sequences, and gesture image sequences in the target scene, it can comprehensively collect verbal information along with non-verbal information such as emotions, attitudes, and actions, achieving complete coverage and temporal alignment of multimodal data, thus completely solving the deficiency of insufficient information from a single modality at the data level. Secondly, by extracting facial expression features and gesture motion features, and determining the intent correction factor of non-verbal modalities to speech semantics, it can quantify and fuse user emotion, attitude, and intent tendency information, completing cognitive correction of ambiguous and contradictory expressions in speech during the semantic understanding stage. This significantly improves the accuracy and reliability of intent recognition, avoiding ambiguity or deviation in the interpretation of the user's true intent, especially when the speech content is inconsistent with non-verbal expressions, making it difficult to dynamically correct the semantic understanding results and thus generating inappropriate... The present invention addresses several issues related to voice interaction feedback. Firstly, by generating initial response text via voice and then using an intent correction factor for semantic adjustment and synthesizing the response voice, the robot's output response content, semantic bias, and emotional attitude can align with the user's actual needs. This achieves cognitive correction of the linguistic modality by the non-verbal modality, effectively avoiding inappropriate feedback and misunderstandings caused by relying solely on voice, making the interaction more reasonable and human-like. Secondly, by determining the voice pause intervals, constructing the micro-action trigger sequence, and controlling the robot to execute non-verbal interaction feedback, the micro-actions of the humanoid robot can be highly matched with the voice playback rhythm and user communication pauses. This effectively improves the problems of disconnect between action and voice, stiff interaction, and insufficient anthropomorphism, effectively achieving more natural and human-like humanoid robot voice interaction through multimodal collaboration. In summary, the technical solution provided in this application can achieve cognitive correction of the linguistic modality by the non-verbal modality in multimodal voice interaction of humanoid robots. Attached Figure Description

[0017] Figure 1 This is a schematic diagram of an application scenario architecture for a multimodal voice interaction method for humanoid robots, as shown in some embodiments of this application. Figure 2 This is an exemplary flowchart of a multimodal voice interaction method for humanoid robots according to some embodiments of this application; Figure 3 This is an exemplary flowchart illustrating the determination of facial expression features according to some embodiments of this application; Figure 4 This is a schematic diagram of the structure of a multimodal voice interaction system for humanoid robots, as shown in some embodiments of this application; Figure 5 This is a schematic diagram of the structure of a computer device that implements a multimodal voice interaction method for humanoid robots, according to some embodiments of this application. Detailed Implementation

[0018] To better understand the technical solution of this application, the technical solution of this application will be described in detail below with reference to the accompanying drawings and specific embodiments.

[0019] refer to Figure 1 The figure is a schematic diagram of an application scenario architecture of a multimodal voice interaction method for humanoid robots according to some embodiments of this application. The application scenario architecture includes a user and a humanoid robot. When the user interacts with the humanoid robot via voice, the humanoid robot obtains the interaction information sent by the user and provides interactive feedback to the user based on the interaction information.

[0020] refer to Figure 2 The figure is an exemplary flowchart of a multimodal voice interaction method for humanoid robots according to some embodiments of this application. The figure mainly includes the following steps: In step S101, the user's voice stream, user's facial image sequence, and user's gesture image sequence in the target voice interaction scenario are obtained.

[0021] In specific implementation, the user's voice stream in the target voice interaction scenario is acquired through a microphone array integrated into the head of the humanoid robot, and the user's facial image sequence is acquired through a camera integrated into the head of the humanoid robot. The user's facial image sequence includes user facial images at different time points. The user's gesture image sequence is acquired through an infrared gesture sensor integrated into the head of the humanoid robot. The user's gesture image sequence includes user gesture images at different time points. To ensure the time alignment of multimodal data, the user's voice stream, user's facial image sequence, and user's gesture image sequence are synchronized through a software timestamp mechanism.

[0022] In step S102, facial expression features are extracted from the user's facial image sequence, hand gesture motion features are extracted from the user's hand gesture image sequence, and the user's non-verbal modality intention correction factor for speech semantics is determined based on the facial expression features and the hand gesture motion features.

[0023] In some embodiments, reference Figure 3 As shown, this figure is an exemplary flowchart illustrating the determination of facial expression features according to some embodiments of this application. In this embodiment, the extraction of facial expression features from the user's facial image sequence can be achieved using the following steps: In step S1021, face region detection is performed on each user face image in the user face image sequence to obtain the face bounding box region corresponding to each user face image; In step S1022, facial feature points are located based on the bounding box region of the face image corresponding to each user's face image to obtain the set of facial feature point coordinates corresponding to each user's face image. Based on all the set of facial feature point coordinates, the bounding box region of each user's face image is affinely transformed and aligned to the preset standard face template coordinate system to obtain the aligned normalized face image sequence. In step S1023, the reference frame difference features and intra-frame texture deformation features are extracted from each user's facial image in the face normalization image sequence, and the reference frame difference features and intra-frame texture deformation features of each user's facial image are spliced ​​together along the frame order to obtain facial expression features.

[0024] In specific implementation, firstly, for each user face image in the user face image sequence, the Viola-Jones face detection algorithm is used. Based on Haar wavelet features and a cascaded AdaBoost classifier, a sliding window scan is performed on the entire image. Haar features are extracted for each window and input into the cascaded classifier. When the confidence score of the window image features output by the cascaded classifier is greater than a confidence threshold, it is determined that a target face exists within that window. Then, from all windows that meet the conditions, the rectangular bounding region that can completely enclose the face region and has the smallest pixel area is selected as the final face region. The pixel coordinates of the upper left and lower right corners of this rectangular bounding region are output to form the face bounding box region. The face bounding box region refers to the rectangular region that can completely enclose the user's face and has the smallest area. Then, based on the determined face bounding box region, the Active Shape Model (ASM) facial 68-point feature point localization algorithm is used to locate key points such as the inner and outer corners of the eyes, the left and right endpoints of the eyebrows, the tip and wings of the nose, the corners of the mouth and the upper and lower lip shapes, and the outer contour edges of the face. The set of facial feature point coordinates corresponding to each user face image is output. The facial feature point coordinate set is a data set consisting of the two-dimensional pixel coordinates of 68 facial key points. An affine transformation matrix is ​​calculated based on the facial feature point coordinate set. The images within the bounding box of each user's facial image are uniformly scaled to a size of 128 pixels × 128 pixels. Then, rotation and translation operations are performed to align the positions of the facial key points with a preset standard face template. After eliminating differences in image pose, scale, and position, a normalized face image sequence is obtained. This normalized face image sequence consists of continuous faces with a uniform size of 128 × 128 pixels and whose poses have been standardized. Frame image data; finally, using the first frame of the normalized face image sequence as the reference user face image, the grayscale difference between the corresponding pixels of each subsequent user face image and the reference user face image is calculated to obtain the reference frame differential features. At the same time, Gaussian gradient filtering with a standard deviation of 1.5 is applied to the single frame normalized image to extract local texture changes and obtain intra-frame texture deformation features. The reference frame differential features and intra-frame texture deformation features of the same user face image are concatenated according to the channel dimension and then sequentially concatenated along the time frame order to obtain facial expression features that can completely represent the dynamic changes of the face.

[0025] It should be noted that, in this application, facial expression features refer to high-dimensional feature data describing continuous changes in facial expressions. In multimodal voice interaction, the user's true intentions often cannot be fully expressed by voice content alone. Facial expressions can directly reflect the user's emotional state, attitude, and true psychological intentions. They are the most intuitive and timely non-verbal supplement and verification information for voice semantics. Therefore, extracting facial expression features can accurately capture changes in the user's emotions and attitudes, providing a key basis for non-verbal modal correction of voice semantics, effectively solving the problems of ambiguous voice semantics, contradictory expressions, or unclear intentions, and improving the accuracy of the robot's understanding of user intentions and its human-like interaction capabilities.

[0026] In some embodiments, extracting gesture motion features from the user gesture image sequence is achieved through the following steps: For each user gesture image in the user gesture image sequence, hand region detection is performed to obtain the hand bounding box region corresponding to each user gesture image; Hand skeleton keypoints are detected based on the bounding box region of the hand corresponding to each user's gesture image, and the coordinate set of the hand skeleton keypoints corresponding to each user's gesture image is obtained. The coordinate set of the hand skeleton keypoints corresponding to each user's gesture image is arranged in the frame order to form a spatiotemporal sequence of hand skeleton keypoints. Inter-frame difference calculation is performed on the coordinates of the same hand joint point in adjacent user gesture images in the spatiotemporal sequence of the hand bone key points to obtain the inter-frame displacement vector sequence of each hand joint point; The inter-frame displacement vector sequences of all hand joints are spatially normalized according to the joint topological adjacency relationship to obtain the gesture motion features.

[0027] In specific implementation, firstly, for each user gesture image in the user gesture image sequence, a hand detection algorithm combining YCrCb skin color segmentation and convex hull detection is used. The user gesture image is first converted from the RGB color space to the YCrCb color space. When the Cr component of a pixel is in the range of 133-173 and the Cb component is in the range of 77-127, the pixel is determined to be a skin color pixel. The hand region is then located through connected component analysis, and convex hull detection is used to determine the outer boundary of the hand. The rectangular bounding box region, which can completely enclose the hand and fingers and has the smallest pixel area, is selected as the effective hand region. The pixel coordinates of this rectangle are output to form the hand bounding box region. The hand bounding box region refers to the rectangular region that can completely enclose the user's palm and fingers and has the smallest area. Secondly, based on the determined hand bounding box region, the OpenPose 21-point hand skeletal keypoint detection algorithm is used to detect the joint positions of the wrist center point, metacarpophalangeal joints, proximal interphalangeal joints, distal interphalangeal joints, and fingertips. The set of hand skeletal keypoint coordinates corresponding to each user gesture image is output. The skeletal keypoint coordinate set is a data set consisting of two-dimensional pixel coordinates of 21 hand joint keypoints. The hand skeletal keypoint coordinate sets corresponding to each user gesture image are arranged sequentially according to the acquisition time, forming a spatiotemporal sequence of hand skeletal keypoints that reflects continuous changes in hand movements. This spatiotemporal sequence is continuous hand joint coordinate data arranged in chronological order. Then, the horizontal and vertical differences of the coordinates of the same hand skeletal keypoint in two adjacent frames of user gesture images are calculated to obtain the displacement of the corresponding joint between adjacent frames. All joint displacements are combined in frame order to form an inter-frame displacement vector sequence, which is continuous data representing the direction and amplitude of hand joint movement. Finally, the inter-frame displacement vector sequences of all hand joints are normalized according to the hand skeletal topology, with maximum and minimum scale normalization and center position normalization, mapping the displacement values ​​to the [-1,1] interval to eliminate the influence of differences in hand size among different users. This ultimately yields gesture motion features that accurately represent the shape and trend of hand movements.

[0028] It should be noted that the gesture motion features in this application are high-dimensional feature data describing the dynamic changes of gestures. In multimodal voice interaction, users often use gestures to emphasize, supplement, negate, or provide directional explanations of the voice content. Gesture motion features can intuitively reflect the user's interaction intention, attitude tendency, and command direction. They can make up for the lack of information in expressing intentions such as emphasis, selection, and rejection in a single voice modality. They can effectively eliminate the ambiguity and vagueness of voice semantics, enhance the completeness and accuracy of understanding interaction intentions, and make the feedback of humanoid robots more in line with the user's real needs.

[0029] In some embodiments, determining the user's nonverbal modality's intent correction factor for speech semantics based on the facial expression features and the gesture motion features is achieved through the following steps: The facial expression features and the gesture motion features are time-aligned to obtain a synchronized sequence of non-verbal feature pairs. The facial expression features and gesture motion features at each time step in the non-verbal feature pair sequence are concatenated to obtain the fused non-verbal representation vector at each time step. The fused non-linguistic representation vectors at each time step are subjected to cross-modal attention interaction with the speech and semantic features of the corresponding time step to obtain the intent offset at each time step. The intent offsets at each time step are pooled and aggregated along the time dimension to obtain the intent correction factor of the user's non-verbal modality on speech semantics.

[0030] In specific implementation, firstly, facial expression features and gesture motion features are matched moment-by-moment according to a unified timestamp, so that facial expression features and gesture motion features at the same time point correspond one-to-one, resulting in a non-verbal feature pair sequence that is fully synchronized in time. The non-verbal feature pair sequence refers to the paired data of facial expression features and gesture motion features after time alignment. Secondly, channel-level feature concatenation is performed on the facial expression features and gesture motion features at each time step in the non-verbal feature pair sequence. The two types of feature vectors are connected end-to-end according to their dimensions and fused into a single vector, resulting in the fused non-verbal representation vector corresponding to each time step. The fused non-verbal representation vector is a single high-dimensional feature vector that integrates facial and gesture information. Then, the fused non-verbal representation vector of each time step is compared with the speech semantic features of the same time step (the speech semantic features refer to the high-dimensional feature vectors representing the user's core intent, contextual logic, and expressive meaning extracted from the text converted from the user's speech, which are obtained by first matching the user's speech with the speech semantic features of the same time step). The user's speech stream is framed, windowed, and spectral transformed to obtain acoustic features. These acoustic features are then converted into a user text sequence using an automatic speech recognition model. The text sequence is then segmented and word vectors are mapped to obtain a word embedding vector sequence. Finally, a context encoding network is used to extract global features from the word embedding vector sequence, resulting in speech semantic features that fully express the semantic meaning of the user's speech. The input is a scaled dot product cross-modal attention layer. First, the correlation between the two types of features is calculated using vector dot products. Then, attention weights are obtained by normalization using the Softmax function. The two types of features are then weighted and fused, outputting an intent offset that represents the magnitude of non-linguistic information's semantic correction. This intent offset measures the magnitude of non-linguistic modality's adjustment to speech semantics. Finally, the intent offsets at all time steps are globally averaged along the time dimension, and the arithmetic mean of multiple temporal graph offsets is aggregated into a single value, resulting in an intent correction factor that comprehensively represents the effect of non-linguistic modality correction.

[0031] It should be noted that the intent correction factor in this application refers to the parameter used to weight and adjust the speech semantic understanding results. In multimodal speech interaction, the user's true intent often shows inconsistencies between speech semantics and facial expressions and gestures. Relying solely on the speech modality is prone to comprehension bias and cannot accurately capture the user's potential attitude and true needs. Therefore, determining the intent correction factor is to quantify and integrate the non-verbal information reflected by facial expressions and gestures to form a weighted adjustment parameter that can directly affect speech semantics, dynamically correct speech semantic comprehension biases, eliminate semantic ambiguity and contradictory expressions, and enable humanoid robots to break through the limitations of a single speech modality and accurately identify the user's true intent.

[0032] In step S103, a dialogue response text is generated through the user's voice stream, and the intention correction factor is used to semantically adjust the dialogue response text to generate an adjusted dialogue response text, which is then synthesized into a response voice.

[0033] In some embodiments, generating dialogue response text from the user's voice stream is achieved through the following steps: Speech recognition is performed on the user's voice stream to obtain the user's text sequence; The user text sequence is segmented to obtain a word sequence, and the word sequence is mapped to word vectors to obtain a word embedding vector sequence. The word embedding vector sequence is subjected to contextual semantic encoding to obtain the speech semantic features. The speech semantic features are then input into a preset dialogue response generator to generate dialogue response text.

[0034] In specific implementation, firstly, the user's speech stream is processed by 25ms framing, 10ms frame shifting, and Hanning windowing. A 40-dimensional Mel-spectrum feature is obtained through Fast Fourier Transform. This Mel-spectrum feature is input into the DeepSpeech end-to-end automatic speech recognition model. After extracting acoustic features through convolutional and recurrent layers, it is jointly decoded using an n-gram language model to convert the audio signal into a text sequence of user speech. This text sequence is a string of text data corresponding to each character of the user's speech content. Then, the user text sequence is split using a Chinese word segmentation algorithm based on dictionary matching and Hidden Markov Model (HMM). The continuous text is divided into independent semantic units, resulting in a word sequence composed of multiple word units. This word sequence is the smallest semantic unit formed after text segmentation. The process involves combining word embeddings. A sequence of word units is input into a Word2Vec pre-trained word vector model. Each word unit is converted into a 300-dimensional fixed-dimensional numerical vector through a lookup table, resulting in a sequence of word embedding vectors representing the semantic meaning of the text's vocabulary. Finally, this sequence is input into a Bidirectional Long Short-Term Memory (BiLSTM) network for context encoding, extracting global semantic information from the text and outputting speech semantic features that fully represent the user's intended meaning. These speech semantic features are high-dimensional feature data describing the core meaning of the user's speech. These speech semantic features are then input into a dialogue response generation model based on Seq2Seq and attention mechanisms. This model combines contextual information and knowledge base content to generate the dialogue response text to be output by the robot word by word.

[0035] It should be noted that the dialogue response text in this application is the initial text response content generated by the humanoid robot based on speech semantics.

[0036] In some embodiments, the semantic adjustment of the dialogue response text using the intent correction factor to generate an adjusted dialogue response text, and then the synthesis of the adjusted dialogue response text into a response speech, is achieved through the following steps: The dialogue response text is semantically encoded to obtain a response semantic vector; The response semantic vector is weighted element-wise with the intent correction factor to obtain the adjusted response semantic vector. Text decoding is performed on the adjusted response semantic vector to obtain the adjusted dialogue response text; The adjusted dialogue response text is converted into an acoustic feature sequence, and the acoustic feature sequence is converted into a response speech waveform to obtain the response speech.

[0037] In specific implementation, firstly, the dialogue response text is input into a TextCNN-based semantic encoding model. Semantic information is extracted through three convolutional, max-pooling, and fully connected layers, converting it into a 512-dimensional fixed-dimensional vector representation. This yields a response semantic vector, which is a numerical vector describing the semantics of the dialogue response text. Secondly, the response semantic vector is multiplied and weighted element-wise with an intent correction factor along the same dimension. This correction factor adjusts the numerical distribution of the vector to change the semantic tendency of the response text, resulting in a corrected and adjusted response semantic vector. This adjusted response semantic vector is a numerical response semantic vector corrected for non-linguistic information. Finally, the adjusted response semantic vector is input into an LSTM-based text decoding network, where word-by-word probability prediction and text indexing are used for encoding. The mapping process restores the vectors to readable text, outputting an adjusted dialogue response text that matches the user's true intent. This adjusted dialogue response text is the robot's final text response content after semantic correction. Finally, the adjusted dialogue response text is input into the TTS text preprocessing module, where text normalization, polyphonic character annotation, prosodic boundary prediction, and tone annotation are performed sequentially. Then, the Tacotron2 acoustic parameter prediction model converts the text into an acoustic feature sequence containing fundamental frequency, Mel spectrum, and energy information. This acoustic feature sequence is continuous feature data representing speech pronunciation attributes. The acoustic feature sequence is input into the WaveNet vocoder, where autoregressive waveform generation and inverse Fourier transform convert the acoustic features into a continuous speech waveform. After amplitude smoothing and noise reduction, a directly playable response speech is obtained.

[0038] It should be noted that the response voice in this application is anthropomorphic voice audio data output by the robot. Furthermore, in multimodal voice interaction, the original dialogue response text generated solely by voice cannot integrate the true intent reflected by non-verbal information such as facial expressions and gestures, which can easily lead to semantic bias, inconsistent attitude, or inappropriate feedback. Therefore, determining the dialogue response text after adjustment by the intent correction factor is to dynamically correct and optimize the semantics of the original response with quantified non-verbal information, so that the final output text content, tone, and expression tendency fully match the user's true emotions, attitudes, and actual needs, eliminating the comprehension errors caused by single-modal voice, making the robot's response more accurate, more context-appropriate, and more human-like, thereby improving the naturalness and reliability of the overall interaction.

[0039] In step S104, the voice pause intervals in the user's voice stream are determined, and the micro-movement triggering sequence of the humanoid robot during the playback of the response voice is identified based on the voice pause intervals. Then, based on the micro-movement triggering sequence, the humanoid robot is controlled to provide interactive feedback on non-semantic sounds within the voice pause intervals.

[0040] In some embodiments, determining speech pause intervals in the user's speech stream is achieved through the following steps: Speech activity detection is performed on the user's speech stream to obtain a sequence of spoken segments and a sequence of silent segments; Non-lexical speech recognition is performed on each of the spoken segments in the spoken segment sequence to obtain the non-lexical speech segments and their start and end timestamps within each spoken segment. The non-lexical speech segments include filler pauses. The start and end timestamps of each non-lexical speech segment are extended forward and backward to the nearest silence boundary in the adjacent silence segment sequence to obtain each extended speech pause candidate interval. Extract the audio segment from the user's audio stream that corresponds to the candidate audio pause interval, and perform semantic coherence verification on the audio content before and after the audio segment. If the verification passes, the candidate audio pause interval is determined as the audio pause interval.

[0041] In specific implementation, firstly, a speech activity detection algorithm based on short-time energy and zero-crossing rate thresholds is used to detect user speech streams. When the short-time energy is less than the energy threshold and the zero-crossing rate is less than the zero-crossing rate threshold, the audio segment is determined to be a silent segment; otherwise, it is determined to be a valid speech segment. Based on this, a sequence of spoken segments composed of valid speech segments and a sequence of silent segments composed of silent segments are output. The spoken segment sequence is a combination of audio segments containing valid user speech, and the silent segment sequence is a combination of audio segments without user speech. Secondly, a non-lexical speech recognition algorithm based on Dynamic Time Warping (DTW) template matching is used for each audio segment in the spoken segment sequence. The similarity between the audio segment and a preset pause template is calculated. When the DTW distance is less than a distance threshold, the segment is determined to be a filler pause. Non-semantic sounds, such as speech sounds, are used to obtain non-lexical speech segments and their start and end timestamps. Non-lexical speech segments are audio segments with no actual semantic meaning, such as tone. Then, the start timestamp of the non-lexical speech segment is extended forward to the start boundary of the adjacent silent segment, and the end timestamp is extended backward to the end boundary of the adjacent silent segment, forming a speech pause candidate interval that includes pauses and surrounding silent areas. The speech pause candidate interval is an audio range that is initially screened as potentially effective pauses. Finally, the audio segment corresponding to the speech pause candidate interval is extracted from the user's speech stream, and the semantic probability of the speech content before and after the audio segment is calculated using a 3-gram language model. When the semantic probability is greater than the semantic probability threshold, the preceding and following parts are determined to be semantically coherent, and the candidate interval is determined as the speech pause interval.

[0042] It should be noted that, in this application, the speech pause interval refers to an effective pause audio segment in the user's speech that has no actual semantic meaning. In multimodal speech interaction, the speech pause interval is a key non-semantic signal for the user to express thinking, hesitation, emphasis, or tone transition. This kind of communication rhythm information cannot be reflected by the speech content alone. Therefore, determining the speech pause interval is to accurately locate the natural pause position and duration in the speech, providing a temporal basis for the robot's micro-motion feedback. This enables the humanoid robot to match the natural communication habits of humans and make timely feedback such as nodding and blinking during the speech playback process, enhancing the fluency and anthropomorphism of the interaction, while avoiding conflicts between actions and speech, and improving the realism and comfort of the overall dialogue.

[0043] In some embodiments, identifying the micro-movement triggering sequence of the humanoid robot during the playback of the response voice based on the voice pause interval is achieved through the following steps: Obtain the length and position of the voice pause interval, wherein the position includes the start and end timestamps of the voice pause interval in the user's voice stream; Calculate the relative time position ratio of the voice pause interval based on the start and end timestamps and the total duration of the user's voice stream; Obtain the total duration of the response voice, and determine the micro-action triggering time based on the relative time position ratio and the total duration of the response voice; The duration of the micro-action is obtained based on the interval length, and the micro-action trigger time and the duration of the micro-action are combined into a micro-action trigger sequence.

[0044] In specific implementation, firstly, the duration of the determined voice pause interval and the start and end timestamps of the pause interval in the user's voice stream are read. The duration of the voice pause interval is the numerical value of the duration of the paused audio segment. Secondly, the start timestamp of the voice pause interval is divided by the total duration of the user's voice stream to obtain the relative time position ratio of the voice pause interval in the user's voice. The relative time position ratio is the proportion of the pause interval's time position in the entire voice stream. Then, the total duration of the response voice is obtained, and the relative time position ratio is multiplied by the total duration of the response voice to calculate the micro-action trigger time corresponding to the humanoid robot playing the response voice. The micro-action trigger time is the time point when the robot begins to execute the micro-action. Finally, a lookup and matching is performed in a preset interval duration and micro-action duration mapping table based on the voice pause interval duration to obtain the micro-action duration that completely corresponds to the interval duration. The micro-action trigger time and micro-action duration are combined to form a complete micro-action trigger sequence. The interval duration and micro-action duration mapping table are used to obtain the micro-action duration that completely corresponds to the interval duration. It is a set of correspondence data pre-stored in the humanoid robot system, used to directly map the duration of the user's voice pause interval to the reasonable duration of the robot's micro-movements such as nodding, blinking, and slightly turning its head, ensuring that the rhythm of the movements is consistent with the rhythm of the voice pauses. This mapping table is formulated based on the movement rhythm of human daily communication, following the rule that short pauses correspond to short movements and long pauses correspond to long movements, avoiding excessively fast or long movements that would cause interaction disharmony, and ensuring the natural and smooth anthropomorphic feedback. For example, when the duration of the voice pause interval is 0. When the duration is 2 seconds to 0.5 seconds, the mapped micro-action lasts for 0.2 seconds, corresponding to a rapid blink; when the duration of the speech pause is 0.5 seconds to 1.0 seconds, the mapped micro-action lasts for 0.5 seconds, corresponding to a slight nod; when the duration of the speech pause is 1.0 seconds to 1.5 seconds, the mapped micro-action lasts for 1.0 seconds, corresponding to a slow head turn; when the duration of the speech pause is greater than 1.5 seconds, the mapped micro-action lasts for 1.2 seconds, corresponding to a combination of nodding and blinking, which will not be elaborated further here.

[0045] It should be noted that the micro-action triggering sequence in this application is an action execution plan that includes the action start time and duration. In multimodal voice interaction, in order to ensure that the micro-movements of the humanoid robot's limbs are precisely coordinated with the rhythm of voice playback and the position of the user's voice pauses, and to avoid the disconnect, conflict or inappropriate timing of actions and voice, determining the micro-action triggering sequence can synchronously map the time position and duration information of the user's voice pauses onto the playback time axis of the robot's response voice, clarify the start time and duration of the micro-action, and ensure that the robot performs human-like feedback such as nodding, blinking, and slightly turning its head at the appropriate time, so as to significantly improve the smoothness, realism and humanization of the interaction.

[0046] In some embodiments, controlling the humanoid robot to provide interactive feedback on non-semantic sounds within the speech pause interval based on the micro-motion trigger timing is achieved through the following steps: The reply voice is input into the audio playback unit of the humanoid robot for playback, and the current playback time is monitored in real time during the playback process; When the current playback time reaches the micro-motion trigger time recorded in the micro-motion trigger time sequence, the micro-motion instruction sequence corresponding to the duration of the micro-motion recorded in the micro-motion trigger time sequence is retrieved from the preset micro-motion instruction library; The micro-motion command sequence is sent to the joint actuator array of the humanoid robot to perform actions, thereby completing the interactive feedback of responding to non-semantic sounds.

[0047] In specific implementation, firstly, the response voice is transmitted to the humanoid robot's audio playback unit for real-time playback, while a clock module continuously monitors the current playback time to obtain the current playback moment. Then, when the current playback moment is exactly consistent with the micro-action trigger moment recorded in the micro-action trigger sequence, a micro-action instruction sequence of the corresponding duration is retrieved from a preset micro-action instruction library based on the micro-action duration recorded in the micro-action trigger sequence. The micro-action instruction sequence is a combination of servo angle and time instructions that control the humanoid robot to complete micro-actions such as nodding, blinking, and slight head turning. Finally, the micro-action instruction sequence is sent to the humanoid robot's joint actuator array through the UART serial communication interface. The joint actuator array drives the servo to rotate to a specified angle and maintain it for the corresponding duration according to the instructions, thereby completing the interactive feedback of non-semantic sounds within the time period corresponding to the voice pause. The joint actuator array is a combination of motion control components consisting of robot head and neck servos and drive circuits.

[0048] It should be noted that, in this application, non-semantic voice interactive feedback refers to the responses made by humanoid robots to users through physical actions such as head posture, facial expressions, and micro-movements during multimodal voice interaction. Specifically, this includes anthropomorphic micro-movements synchronized with the voice rhythm, such as nodding, blinking, slightly turning the head, and eye contact. This type of feedback can convey a state of attention, understanding, recognition, or listening in conjunction with the voice output, and is an important way to improve the naturalness and anthropomorphism of the interaction.

[0049] Furthermore, in another aspect of this application, in some embodiments, this application provides a multimodal voice interaction system for humanoid robots, referencing... Figure 4 The figure is a schematic diagram of the structure of a multimodal voice interaction system for humanoid robots according to some embodiments of this application. The multimodal voice interaction system for humanoid robots includes: an acquisition module 201, a processing module 202, and an execution module 203, which are described below: The acquisition module 201 in this application is mainly used to acquire the user's voice stream, user's facial image sequence, and user's gesture image sequence in the target voice interaction scenario. Processing module 202, in this application, is mainly used to extract facial expression features from the user's facial image sequence, extract gesture motion features from the user's gesture image sequence, and determine the user's non-verbal modality intention correction factor for speech semantics based on the facial expression features and the gesture motion features; The processing module 202 is further configured to generate dialogue response text through the user's voice stream, and use the intent correction factor to semantically adjust the dialogue response text to generate adjusted dialogue response text, and then synthesize the adjusted dialogue response text into a response voice. The execution module 203 in this application is mainly used to determine the speech pause interval in the user's speech stream, and to identify the micro-movement triggering sequence of the humanoid robot during the playback of the reply speech based on the speech pause interval, and then control the humanoid robot to perform interactive feedback on non-semantic sounds within the speech pause interval based on the micro-movement triggering sequence.

[0050] In addition, this application also provides a computer device, the computer device including a memory and a processor, the memory storing code, and the processor being configured to acquire the code and execute the above-described multimodal voice interaction method for humanoid robots.

[0051] In some embodiments, reference Figure 5 The figure is a schematic diagram of the structure of a computer device implementing a multimodal voice interaction method for humanoid robots, according to some embodiments of this application. The multimodal voice interaction method for humanoid robots in the above embodiments can be implemented through... Figure 5 The computer device shown is used to implement this, and the computer device includes at least one processor 301, a communication bus 302, a memory 303, and at least one communication interface 304.

[0052] The processor 301 can be a general-purpose central processing unit (CPU), an application-specific integrated circuit (ASIC), or one or more devices used to control the execution of the multimodal voice interaction method for humanoid robots in this application.

[0053] The communication bus 302 can be used to transmit information between the aforementioned components.

[0054] The memory 303 may be a read-only memory (ROM) or other type of static storage device capable of storing static information and instructions, random access memory (RAM) or other type of dynamic storage device capable of storing information and instructions, or electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compressed optical discs, laser discs, optical discs, digital universal optical discs, Blu-ray discs, etc.), magnetic disks or other magnetic storage devices, or any other medium capable of carrying or storing desired program code in the form of instructions or data structures and accessible by a computer, but not limited thereto. The memory 303 may exist independently and be connected to the processor 301 via the communication bus 302. The memory 303 may also be integrated with the processor 301.

[0055] The memory 303 stores program code for executing the scheme of this application, and its execution is controlled by the processor 301. The processor 301 executes the program code stored in the memory 303. The program code may include one or more software modules. In the above embodiments, the determination of the multimodal voice interaction method for humanoid robots can be achieved through the processor 301 and one or more software modules in the program code in the memory 303.

[0056] Communication interface 304 uses any transceiver-like device for communicating with other devices or communication networks, such as Ethernet, radio access network (RAN), wireless local area networks (WLAN), etc.

[0057] In a specific implementation, as one example, a computer device may include multiple processors, each of which may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. Here, a processor may refer to one or more devices, circuits, and / or processing cores used to process data (e.g., computer program instructions).

[0058] The aforementioned computer device can be a general-purpose computer device or a special-purpose computer device. In specific implementations, the computer device can be a desktop computer, a portable computer, a network server, a handheld digital assistant (PDA), a mobile phone, a tablet computer, a wireless terminal device, a communication device, or an embedded device. This application does not limit the type of computer device.

[0059] In addition, this application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described multimodal voice interaction method for humanoid robots.

[0060] Although preferred embodiments of this application have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of this application.

[0061] Obviously, those skilled in the art can make various modifications and variations to this application without departing from the spirit and scope of this application. Therefore, if such modifications and variations fall within the scope of the claims of this application and their equivalents, this application also intends to include such modifications and variations.

Claims

1. A multimodal voice interaction method for humanoid robots, characterized in that, Includes the following steps: Acquire user speech stream, user facial image sequence, and user gesture image sequence in the target voice interaction scenario; Facial expression features are extracted from the user's facial image sequence, and hand gesture motion features are extracted from the user's hand gesture image sequence. Based on the facial expression features and the hand gesture motion features, the user's non-verbal modality intention correction factor for speech semantics is determined. The user's voice stream is used to generate a dialogue response text, and the intent correction factor is used to semantically adjust the dialogue response text to generate an adjusted dialogue response text, which is then synthesized into a response voice. The speech pause intervals in the user's speech stream are determined, and the micro-movement triggering sequence of the humanoid robot during the playback of the response speech is identified based on the speech pause intervals. Then, based on the micro-movement triggering sequence, the humanoid robot is controlled to provide interactive feedback on non-semantic sounds within the speech pause intervals.

2. The method as described in claim 1, characterized in that, Extracting facial expression features from the user's facial image sequence specifically includes: Perform face region detection on each user face image in the user face image sequence to obtain the face bounding box region corresponding to each user face image; Facial feature points are located based on the bounding box region of the face image of each user, and the coordinate set of facial feature points corresponding to each user's face image is obtained. Based on the coordinate set of all facial feature points, the bounding box region of the face image of each user is aligned to the preset standard face template coordinate system by performing an affine transformation on the bounding box region of the face image of each user to obtain the aligned normalized face image sequence. For each user's facial image in the normalized face image sequence, the reference frame difference features and intra-frame texture deformation features are extracted respectively. The reference frame difference features and intra-frame texture deformation features of each user's facial image are then concatenated along the frame order to obtain facial expression features.

3. The method as described in claim 1, characterized in that, Extracting gesture motion features from the user gesture image sequence specifically includes: For each user gesture image in the user gesture image sequence, hand region detection is performed to obtain the hand bounding box region corresponding to each user gesture image; Hand skeleton keypoints are detected based on the bounding box region of the hand corresponding to each user's gesture image, and the coordinate set of the hand skeleton keypoints corresponding to each user's gesture image is obtained. The coordinate set of the hand skeleton keypoints corresponding to each user's gesture image is arranged in the frame order to form a spatiotemporal sequence of hand skeleton keypoints. Inter-frame difference calculation is performed on the coordinates of the same hand joint point in adjacent user gesture images in the spatiotemporal sequence of the hand bone key points to obtain the inter-frame displacement vector sequence of each hand joint point; The inter-frame displacement vector sequences of all hand joints are spatially normalized according to the joint topological adjacency relationship to obtain the gesture motion features.

4. The method as described in claim 1, characterized in that, Determining the user's nonverbal modality intention correction factor for speech semantics based on the facial expression features and the gesture movement features specifically includes: The facial expression features and the gesture motion features are time-aligned to obtain a synchronized sequence of non-verbal feature pairs. The facial expression features and gesture motion features at each time step in the non-verbal feature pair sequence are concatenated to obtain the fused non-verbal representation vector at each time step. The fused non-linguistic representation vectors at each time step are subjected to cross-modal attention interaction with the speech and semantic features of the corresponding time step to obtain the intent offset at each time step. The intent offsets at each time step are pooled and aggregated along the time dimension to obtain the intent correction factor of the user's non-verbal modality on speech semantics.

5. The method as described in claim 1, characterized in that, Generating dialogue response text from the user's voice stream specifically includes: Speech recognition is performed on the user's voice stream to obtain the user's text sequence; The user text sequence is segmented to obtain a word sequence, and the word sequence is mapped to word vectors to obtain a word embedding vector sequence. The word embedding vector sequence is subjected to contextual semantic encoding to obtain the speech semantic features. The speech semantic features are then input into a preset dialogue response generator to generate dialogue response text.

6. The method as described in claim 1, characterized in that, Determining the speech pause intervals in the user's speech stream specifically includes: Speech activity detection is performed on the user's speech stream to obtain a sequence of spoken segments and a sequence of silent segments; Non-lexical speech recognition is performed on each of the spoken segments in the spoken segment sequence to obtain the non-lexical speech segments and their start and end timestamps within each spoken segment. The non-lexical speech segments include filler pauses. The start and end timestamps of each non-lexical speech segment are extended forward and backward to the nearest silence boundary in the adjacent silence segment sequence to obtain each extended speech pause candidate interval. Extract the audio segment from the user's audio stream that corresponds to the candidate audio pause interval, and perform semantic coherence verification on the audio content before and after the audio segment. If the verification passes, the candidate audio pause interval is determined as the audio pause interval.

7. The method as described in claim 1, characterized in that, The user's voice stream in the target voice interaction scenario is acquired by a microphone array integrated into the head of a humanoid robot.

8. A multimodal voice interaction system for humanoid robots, used to execute the multimodal voice interaction method for humanoid robots as described in any one of claims 1 to 7, characterized in that, The system includes: The acquisition module is used to acquire user voice streams, user facial image sequences, and user gesture image sequences in the target voice interaction scenario; The processing module is used to extract facial expression features from the user's facial image sequence, extract gesture motion features from the user's gesture image sequence, and determine the user's non-verbal modality intention correction factor for speech semantics based on the facial expression features and the gesture motion features. The processing module is further configured to generate dialogue response text through the user's voice stream, and to perform semantic adjustment on the dialogue response text using the intent correction factor to generate adjusted dialogue response text, and then synthesize the adjusted dialogue response text into a response voice. The execution module is used to determine the speech pause intervals in the user's speech stream, identify the micro-movement triggering sequence of the humanoid robot during the playback of the response speech based on the speech pause intervals, and then control the humanoid robot to perform interactive feedback on non-semantic sounds within the speech pause intervals based on the micro-movement triggering sequence.

9. A computer device, characterized in that, The computer device includes a memory and a processor, the memory storing code, and the processor being configured to retrieve the code and execute the multimodal voice interaction method for humanoid robots as described in any one of claims 1 to 7.

10. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the multimodal voice interaction method for humanoid robots as described in any one of claims 1 to 7.