Multimodal input-based speaker-independent real-time gesture generation system
A multimodal system generates real-time gestures based on audio and text without speaker IDs, addressing integration and responsiveness issues, facilitating adaptable and context-aware gesture generation.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- KOREA ELECTRONICS TECH INST
- Filing Date
- 2025-11-05
- Publication Date
- 2026-06-18
AI Technical Summary
Existing co-speech gesture generation systems rely on speaker IDs, making them difficult to integrate with data from new speakers and unsuitable for rapid interaction environments.
A multimodal input-based system that generates gestures in real time using audio, text, and poses without a speaker ID, employing neural networks to extract speaker style and speech patterns, enabling Zero-Shot Learning for style adaptation and real-time synchronization.
Enables flexible, real-time gesture generation adaptable to various speakers and contexts, supporting non-verbal communication in fields like education and therapy.
Smart Images

Figure KR2025018001_18062026_PF_FP_ABST
Abstract
Description
Multimodal input-based speaker-independent real-time gesture generation system
[0001] The present invention relates to a system for generating gestures synchronized with speech in the field of human-computer interaction (HCI), and more specifically, to a system for generating gestures in real time using multimodal data such as audio, text, and poses without a speaker ID.
[0002] Existing Co-speech Gesture Generation systems primarily utilize speaker IDs to generate personalized gestures.
[0003] However, this method is dependent on specific data and has the problem of being difficult to integrate with data from new speakers.
[0004] In addition, due to limitations in real-time responsiveness, it is not suitable for environments requiring rapid interaction.
[0005] The present invention has been devised to solve the above-mentioned problems, and the objective of the present invention is to provide a multimodal input-based speaker-independent real-time gesture generation system and method that generates various speaker gestures in real time based on multimodal data such as audio, text, and poses without a speaker ID (individual identity information of the speaker).
[0006] A method for generating a speaker-independent real-time gesture based on a multimodal input according to an embodiment of the present invention for achieving the above objective comprises: a step in which a system acquires multimodal data; a step in which a system generates a speaker ID embedding that reflects the speaker's style and speech pattern from the acquired multimodal data; and a step in which a system inserts the generated speaker ID embedding into a multimodal gesture generation model to generate a gesture based on the multimodal data.
[0007] The system receives multimodal data including audio data, text data, and gesture data as input, wherein the audio data reflects the speaker's tone and style, the text data includes text content reflecting the speaker's style, and the gesture data may include the speaker's style and gesture patterns.
[0008] In addition, speaker ID embeddings are generated through a neural network that extracts features from audio and text data, and can be converted into information that can be associated with gesture patterns to generate gestures that match the speaker's characteristics.
[0009] And, the multimodal gesture generation model can be trained to generate gestures by utilizing speaker ID embeddings generated by extracting features from audio and text data, or speaker ID embeddings associated with gesture patterns, as training data.
[0010] In addition, the trained multimodal gesture generation model can utilize Zero-Shot Learning techniques to reflect the speaker's unique style through speaker ID embeddings, while enabling style modification when training on new speakers or data from new datasets.
[0011] In addition, the trained multimodal gesture generation model can generate gestures in real time that are synchronized with the content of input audio or text data based on speaker ID embeddings.
[0012] Additionally, the system may include a processing engine that analyzes the content of audio data or text data in real time based on speaker ID embeddings and generates gestures.
[0013] And, to process audio data and text data in parallel, the processing engine can have an audio data processing pipeline that analyzes audio data and a text data processing pipeline that analyzes text data implemented in parallel.
[0014] Additionally, the system receives multimodal data including audio data and text data as input, the audio data reflects the speaker's tone and style, and the text data includes text content reflecting the speaker's style, and the system can generate gestures through a Speaker ID embedding generated based on the audio data and text data even when gesture data is not input.
[0015] Meanwhile, a multimodal input-based speaker-independent real-time gesture generation system according to another embodiment of the present invention comprises: an input unit for acquiring multimodal data; and a processor for generating a speaker ID embedding that reflects the speaker's style and speech pattern from the acquired multimodal data, and inserting the generated speaker ID embedding into a multimodal gesture generation model to generate a gesture based on the multimodal data.
[0016] Additionally, a multimodal input-based speaker-independent real-time gesture generation method according to another embodiment of the present invention comprises: a step in which a system generates a speaker ID embedding that reflects the speaker's style based on features generated through multimodal data including audio data and text data; and a step in which the system inserts the generated speaker ID embedding into a multimodal gesture generation model to generate a gesture suitable for the speaker's characteristics based on the multimodal data.
[0017] And, according to another embodiment of the present invention, a multimodal input-based speaker-independent real-time gesture generation system comprises: an embedding generation unit that generates a speaker ID embedding that reflects the speaker's style based on features generated through multimodal data including audio data and text data; and a gesture generation unit that inserts the generated speaker ID embedding into a multimodal gesture generation model to generate a gesture suitable for the speaker's characteristics based on the multimodal data.
[0018] As described above, according to embodiments of the present invention, by generating various speaker gestures in real time based on multimodal data such as audio, text, and poses without a speaker ID, it is possible to provide gestures capable of situation recognition and personalized style variation.
[0019] In addition, according to embodiments of the present invention, it is possible to generate gestures that are not dependent on the speaker and to generate various styles of gestures in real time, thereby enabling speaker-independent and context-appropriate gesture generation, which can provide more flexible responsiveness in real-time interaction environments and can be utilized for non-verbal communication analysis in fields such as education and therapy.
[0020] FIG. 1 is a drawing provided to describe a multimodal input-based speaker-independent real-time gesture generation system according to an embodiment of the present invention.
[0021] FIG. 2 is a drawing provided for a more detailed description of a processor according to one embodiment of the present invention,
[0022] FIG. 3 is a flowchart provided for describing a multimodal input-based speaker-independent real-time gesture generation method according to an embodiment of the present invention, and
[0023] FIG. 4 is a flowchart provided for a more detailed description of a multimodal input-based speaker-independent real-time gesture generation method according to one embodiment of the present invention.
[0024] The present invention will be described in more detail below with reference to the drawings. To clearly explain the invention, parts unrelated to the description have been omitted from the drawings, and in the drawings, the width, length, thickness, etc., of the components may be exaggerated for convenience.
[0025] FIG. 1 is a drawing provided to describe a multimodal input-based speaker-independent real-time gesture generation system according to one embodiment of the present invention.
[0026] The multimodal input-based speaker-independent real-time gesture generation system according to the present embodiment (hereinafter collectively referred to as the "system") is provided to generate various speaker gestures in real time based on multimodal data such as audio, text, and poses without a speaker ID (individual identity information of the speaker).
[0027] To this end, the system may include an input unit (100), a processor (200), and a storage unit (300).
[0028] The input unit (100) can acquire multimodal data, including a communication module connected to a network. Here, the multimodal data may consist of audio data and text data, and may include gesture data in some cases.
[0029] For example, the input unit (100) may receive multimodal data consisting of audio data, text data and gesture data as input, or multimodal data consisting only of audio data and text data as input.
[0030] In this case, the audio data included in the multimodal data may reflect the speaker's tone and style.
[0031] In addition, the text data included in the multimodal data may include text content that reflects the speaker's style.
[0032] In addition, gesture data included in multimodal data may include the speaker's style and gesture patterns.
[0033] The storage unit (300) is provided to store programs and data necessary for the operation of the processor (200).
[0034] The processor (200) can process all necessary matters to generate various speaker gestures in real time based on multimodal data such as audio, text, and poses without a speaker ID.
[0035] Specifically, the processor (200) can generate a speaker ID embedding that reflects the speaker's style and speech pattern from multimodal data obtained through the input unit (100), and insert the generated speaker ID embedding into a multimodal gesture generation model to generate a gesture based on the multimodal data.
[0036] Here, speaker ID embeddings can be generated through a neural network that extracts features from audio and text data, and can be converted into information that can be associated with gesture patterns to generate gestures that match the speaker's characteristics.
[0037] FIG. 2 is a drawing provided for a more detailed description of a processor (200) according to one embodiment of the present invention.
[0038] Referring to FIG. 2, the processor (200) may include an embedding generation unit (210) and a gesture generation unit (220).
[0039] The embedding generation unit (210) can generate a speaker ID embedding that reflects the speaker's style and speech pattern from multimodal data obtained through the input unit (100).
[0040] Specifically, the embedding generation unit (210) can generate a speaker ID embedding that reflects the speaker's style and speech pattern through a neural network that extracts features from audio data and text data, and convert this into information that can be linked to a gesture pattern.
[0041] The gesture generation unit (220) can generate a gesture based on multimodal data by inserting a speaker ID embedding into a multimodal gesture generation model.
[0042] A multimodal gesture generation model can be trained to generate gestures by utilizing speaker ID embeddings generated by extracting features from audio and text data, or speaker ID embeddings associated with gesture patterns, as training data.
[0043] In addition, the trained multimodal gesture generation model can generate gestures in real time that are synchronized with the content of input audio or text data based on speaker ID embeddings.
[0044] That is, the gesture generation unit (220) can train the multimodal gesture generation model by utilizing the speaker ID embedding generated through the embedding generation unit (210) or the speaker ID embedding associated with the gesture pattern as training data for the multimodal gesture generation model.
[0045] In this case, the learned multimodal gesture generation model can utilize Zero-Shot Learning techniques to reflect the speaker's unique style through speaker ID embeddings, while enabling style modification when learning from new speakers or new datasets.
[0046] Afterwards, the gesture generation unit (220) can generate a gesture synchronized with the content of the input audio data or text data in real time by applying a speaker ID embedding to a learned multimodal gesture generation model.
[0047] To this end, the gesture generation unit (220) may include a processing engine that analyzes the content of audio data or text data in real time based on speaker ID embeddings and generates a gesture.
[0048] The processing engine is implemented with a structure optimized for real-time interaction, capable of analyzing the speaker's utterance in real time and generating gestures immediately.
[0049] To this end, the processing engine can have an audio data processing pipeline that analyzes audio data and a text data processing pipeline that analyzes text data implemented in parallel so that audio data and text data are processed in parallel.
[0050] That is, the gesture generation unit (220) can process and analyze input audio data or text data in parallel by implementing an audio data processing pipeline that analyzes audio data and a text data processing pipeline that analyzes text data in parallel, and generate a gesture synchronized with the content of the audio data or text data in real time based on this.
[0051] And the gesture generated through the gesture generation unit (220) can apply various speaker styles and can provide scalability that can adapt to various situations.
[0052] Through this, the system can support not only real-time co-utterance gesture generation but also customized gesture generation tailored to specific environments or individual styles.
[0053] FIG. 3 is a flowchart provided to describe a multimodal input-based speaker-independent real-time gesture generation method according to one embodiment of the present invention.
[0054] The multimodal input-based speaker-independent real-time gesture generation method according to the present embodiment can be executed by the system described above with reference to FIGS. 1 and 2.
[0055] Referring to FIG. 3, when the system acquires multimodal data consisting of audio data and text data (gesture data is optional) (S310), it generates a speaker ID embedding that reflects the speaker's style and speech pattern from the acquired multimodal data (S320), and inserts the generated speaker ID embedding into a multimodal gesture generation model (S330) to generate a gesture based on the multimodal data (S340).
[0056] That is, the system can receive audio, text, and gesture data (optional) as input as described above, wherein each data may include the speaker's tone of voice, style, text content, and gesture pattern.
[0057] And the system can generate speaker ID embeddings that reflect the speaker's style and speech patterns from the input audio and text data.
[0058] When speaker ID embeddings are generated through a neural network that extracts audio and text features, they can be converted into information that can be associated with gesture patterns, and the generated speaker ID embeddings are inserted into a multimodal gesture generation model to generate gestures based on multimodal data.
[0059] Through this, the system can receive audio and text data and generate gestures synchronized with speech in real time without relying on the speaker's individual identity information (speaker ID).
[0060] FIG. 4 is a flowchart provided for a more detailed description of a multimodal input-based speaker-independent real-time gesture generation method according to one embodiment of the present invention.
[0061] Referring to FIG. 4, when audio, text, and gesture data (optional) are input (S410), the system can extract features from the input audio data and text data or extract features from the audio data, text data, and gesture data (S420), and integrate the extracted features using a loading or attention mechanism (S430) to generate a speaker ID embedding that reflects the speaker's style and speech pattern (S440).
[0062] As described above, the speaker ID embedding can be converted into information that can be associated with a gesture pattern, and the generated speaker ID embedding is inserted into a multimodal gesture generation model (S450) so that a gesture can be generated based on multimodal data (S460).
[0063] A multimodal gesture generation model can be trained to enable various style variations regardless of the speaker's ID, while learning the speaker's gesture style by utilizing speaker ID embeddings as training data.
[0064] Specifically, the multimodal gesture generation model reflects the speaker's unique style through speaker ID embeddings, while also being able to adapt to new speakers or datasets through Zero-Shot Learning techniques.
[0065] Through this, the multimodal gesture generation model can generate gestures in real time using only input data, even when given speakers or styles not included in the existing training.
[0066] Features extracted from multimodal data generate style embeddings that reflect the speaker's speech style, tone, and text expression method, and based on this, provide the ability to automatically adjust gesture styles and dynamically generate gestures suitable for various speakers.
[0067] So far, a preferred embodiment of a multimodal input-based speaker-independent real-time gesture generation system has been described in detail.
[0068] According to embodiments of the present invention, by generating various speaker gestures in real time based on multimodal data such as audio, text, and poses without a speaker ID, it is possible to provide gestures capable of situation recognition and personalized style variation.
[0069] In addition, according to embodiments of the present invention, it is possible to generate gestures that are not dependent on the speaker and to generate various styles of gestures in real time, thereby enabling speaker-independent and context-appropriate gesture generation, which can provide more flexible responsiveness in real-time interaction environments and can be utilized for non-verbal communication analysis in fields such as education and therapy.
[0070] Meanwhile, it goes without saying that the technical concept of the present invention may also be applied to a computer-readable recording medium containing a computer program that enables the device and method according to the present embodiment to perform their functions. Furthermore, the technical concept according to various embodiments of the present invention may be implemented in the form of computer-readable code recorded on a computer-readable recording medium. A computer-readable recording medium may be any data storage device that can be read by a computer and store data. For example, a computer-readable recording medium may be a ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical disk, hard disk drive, etc. Additionally, computer-readable code or a program stored on a computer-readable recording medium may be transmitted through a network connected between computers.
[0071] Furthermore, although preferred embodiments of the present invention have been illustrated and described above, the present invention is not limited to the specific embodiments described above. Various modifications are possible by those skilled in the art without departing from the essence of the invention as claimed in the claims, and such modifications should not be understood individually from the technical spirit or perspective of the present invention.
Claims
1. A step in which the system acquires multimodal data; The system generates a Speaker ID embedding that reflects the speaker's style and utterance pattern from the acquired multimodal data; and A multimodal input-based speaker-independent real-time gesture generation method comprising the step of: the system inserting the generated speaker ID embedding into a multimodal gesture generation model to generate a gesture based on multimodal data.
2. In Claim 1, The system, It receives multimodal data including audio data, text data, and gesture data as input, Audio data, The speaker's tone and style are reflected, Text data, It includes text content that reflects the speaker's style, Gesture data, A multimodal input-based speaker-independent real-time gesture generation method characterized by including the speaker's style and gesture patterns.
3. In Claim 2, Speaker ID embedding is, It is generated through a neural network that extracts features from audio and text data, and A multimodal input-based speaker-independent real-time gesture generation method characterized by being converted into information that can be linked to a gesture pattern to generate a gesture suitable for the speaker's characteristics.
4. In Claim 1, The multimodal gesture generation model is, A multimodal input-based speaker-independent real-time gesture generation method characterized by being trained to generate gestures by utilizing speaker ID embeddings generated by extracting features from audio data and text data, or speaker ID embeddings associated with gesture patterns, as training data.
5. In Claim 4, The trained multimodal gesture generation model is, A multimodal input-based speaker-independent real-time gesture generation method characterized by utilizing a Zero-Shot Learning technique to reflect the speaker's unique style through speaker ID embeddings, while enabling style transformation when learning data from new speakers or new datasets.
6. In Claim 4, The trained multimodal gesture generation model is, A multimodal input-based speaker-independent real-time gesture generation method characterized by generating a gesture in real time that is synchronized with the content of input audio data or text data based on a speaker ID embedding.
7. In Claim 6, The system, A multimodal input-based speaker-independent real-time gesture generation method characterized by including a processing engine that analyzes the content of audio data or text data in real time based on speaker ID embeddings and generates gestures.
8. In Claim 7, The processing engine is, A multimodal input-based speaker-independent real-time gesture generation method characterized by the parallel implementation of an audio data processing pipeline for analyzing audio data and a text data processing pipeline for analyzing text data to process audio data and text data in parallel.
9. In Claim 1, The system, It receives multimodal data including audio data and text data as input, Audio data, The speaker's tone and style are reflected, Text data, It includes text content that reflects the speaker's style, The system, A multimodal input-based speaker-independent real-time gesture generation method characterized by generating a gesture through a speaker ID embedding generated based on audio data and text data, even when gesture data is not input.
10. An input unit for acquiring multimodal data; and A multimodal input-based speaker-independent real-time gesture generation system comprising: a processor that generates a speaker ID embedding reflecting the speaker's style and speech pattern from acquired multimodal data, inserts the generated speaker ID embedding into a multimodal gesture generation model, and generates a gesture based on the multimodal data.
11. A step in which the system generates a Speaker ID embedding that reflects the speaker's style based on features generated through multimodal data including audio data and text data; and A multimodal input-based speaker-independent real-time gesture generation method comprising the step of: the system inserting the generated speaker ID embedding into a multimodal gesture generation model to generate a gesture suitable for the speaker's characteristics based on multimodal data.
12. An embedding generation unit that generates a Speaker ID embedding reflecting the speaker's style based on features generated through multimodal data including audio data and text data; and A multimodal input-based speaker-independent real-time gesture generation system comprising: a gesture generation unit that inserts a generated speaker ID embedding into a multimodal gesture generation model to generate a gesture suitable for the speaker's characteristics based on multimodal data.