Dual processing-based gesture processing system and method utilizing speech or text
The dual-processing-based gesture processing system addresses the lack of contextual information in existing methods by encoding and reprocessing voice or text data with pose data, resulting in more natural and meaningful gestures.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- KOREA ELECTRONICS TECH INST
- Filing Date
- 2025-11-07
- Publication Date
- 2026-06-18
AI Technical Summary
Existing gesture processing technologies fail to adequately incorporate contextual information, leading to difficulties in generating natural and meaningful gestures, particularly in complex interactions.
A dual-processing-based method that involves encoding and reprocessing voice or text data in conjunction with pose data to enhance contextual consistency and naturalness of gestures.
The method ensures enhanced contextual consistency and generates more natural and meaningful gestures by reprocessing the initial gesture output, improving the quality of gesture reconstruction.
Smart Images

Figure KR2025018238_18062026_PF_FP_ABST
Abstract
Description
Dual processing-based gesture processing system and method utilizing voice or text
[0001] The present invention relates to human-computer interaction (HCI) and gesture recognition technology, and more specifically, to a technology for generating, analyzing, and processing gestures using voice or text data.
[0002] Existing gesture processing technologies primarily generate gestures by encoding voice or text data and processing the pose data once.
[0003] For example, most methods generate gestures after processing pose data using the output of a pre-trained text or speech encoder.
[0004] However, this method does not sufficiently reflect contextual information, and it is difficult to maintain semantic consistency between gestures and linguistic data.
[0005] As a result, generating natural and meaningful gestures becomes difficult, and there are limitations in handling complex interactions.
[0006] The present invention has been devised to solve the above-mentioned problems, and the objective of the present invention is to provide a dual-processing-based gesture processing system and method utilizing voice or text data.
[0007] A dual-processing-based gesture processing method using voice or text according to an embodiment of the present invention for achieving the above objective comprises: a step in which a system acquires data; a step in which a system generates a gesture based on an encoding result of the acquired data; and a step in which a system reprocesses the generated gesture based on the acquired data to generate a final gesture.
[0008] In addition, the step of acquiring data can acquire voice data or text data and pose data.
[0009] And the step of generating a gesture can generate a gesture by combining the encoding result of the acquired voice data or text data with the pose data.
[0010] In addition, the step of generating the final gesture may generate the final gesture by reprocessing the gesture generated based on the acquired voice data or text data.
[0011] And the step of generating the final gesture can complement the contextual meaning of the generated gesture to create a final gesture that is more natural than the generated gesture.
[0012] In addition, the dual processing-based gesture processing method using voice or text according to the present embodiment may further include the step of encoding voice data or text data so that the system converts the acquired voice data or text data into information that can be combined with pose data.
[0013] And voice data can reflect the speaker's tone and style.
[0014] In addition, text data may include text content that reflects the speaker's style.
[0015] And the pose data may include the speaker's style and pose patterns.
[0016] Meanwhile, a dual-processing-based gesture processing system utilizing voice or text according to another embodiment of the present invention includes: an input unit for acquiring data; and a processor for generating a gesture based on the encoding result of the acquired data and reprocessing the generated gesture based on the acquired data to generate a final gesture.
[0017] A dual-processing-based gesture processing method utilizing voice or text according to another embodiment of the present invention comprises: a step in which a system encodes voice data or text data; a step in which a system combines the encoding result of voice data or text data with pose data to generate a gesture; and a step in which a system reprocesses the generated gesture based on acquired voice data or text data to generate a final gesture.
[0018] Additionally, a dual-processing-based gesture processing system utilizing voice or text according to another embodiment of the present invention comprises: a first gesture generation unit that encodes voice data or text data and combines the encoding result of the voice data or text data with pose data to generate a gesture; and a second gesture generation unit that reprocesses the gesture generated based on the acquired voice data or text data to generate a final gesture.
[0019] As explained above, according to the embodiments of the present invention, the result of a gesture generated in the first processing is reprocessed (second processing) to enhance contextual consistency and provide a natural and meaningful gesture.
[0020] FIG. 1 is a drawing provided for the description of a dual processing-based gesture processing system utilizing voice or text according to an embodiment of the present invention.
[0021] FIG. 2 is a drawing provided for a more detailed description of a processor according to one embodiment of the present invention, and
[0022] FIG. 3 is a flowchart provided to describe a dual processing-based gesture processing method using voice or text according to an embodiment of the present invention.
[0023] The present invention will be described in more detail below with reference to the drawings. To clearly explain the invention, parts unrelated to the description have been omitted from the drawings, and in the drawings, the width, length, thickness, etc., of the components may be exaggerated for convenience.
[0024] FIG. 1 is a drawing provided to describe a dual processing-based gesture processing system utilizing voice or text according to an embodiment of the present invention.
[0025] A dual-processing-based gesture processing system utilizing voice or text according to the present embodiment (hereinafter collectively referred to as the 'system') can enhance contextual consistency and provide natural and meaningful gestures by reprocessing (secondary processing) the result of a gesture generated (primary processing) using voice or text data.
[0026] To this end, the system may include an input unit (100), a processor (200), and a storage unit (300).
[0027] The input unit (100) can acquire input data by including a communication module connected to a network, etc. Here, the input data may consist of voice data and text data, and may additionally include pause data depending on the case.
[0028] For example, the input unit (100) may receive voice data, text data and pose data as input, or only voice data and text data as input.
[0029] In this case, the voice data included in the input data may reflect the speaker's tone and style.
[0030] And the text data included in the input data may include text content that reflects the speaker's style.
[0031] In addition, the pose data included in the input data may include the speaker's style and pose patterns.
[0032] The storage unit (300) is provided to store programs and data necessary for the operation of the processor (200).
[0033] The processor (200) can reprocess (secondary processing) the result of a gesture generated (primary processing) using voice or text data to enhance contextual consistency and process all necessary matters to provide a natural and meaningful gesture.
[0034] Specifically, the processor (200) can encode input data obtained through the input unit (100), generate a gesture based on the encoding result, and reprocess the generated gesture based on the obtained input data to generate a final gesture.
[0035] FIG. 2 is a drawing provided for a more detailed description of a processor (200) according to one embodiment of the present invention.
[0036] Referring to FIG. 2, the processor (200) may include a first gesture generating unit (210) and a second gesture generating unit (220).
[0037] The first gesture generation unit (210) can encode input data obtained through the input unit (100) and generate a gesture based on the encoding result.
[0038] For example, the first gesture generation unit (210) can encode the voice data or text data to convert the acquired voice data or text data into information that can be combined with the pose data.
[0039] And the first gesture generation unit (210) can generate a gesture by combining the encoding result of voice data or text data and pose data.
[0040] The second gesture generation unit (220) can generate a final gesture by reprocessing the generated gesture based on the acquired input data.
[0041] For example, the second gesture generation unit (220) can generate a final gesture by reprocessing a gesture generated based on voice data or text data.
[0042] At this time, the second gesture generation unit (220) can complement the contextual meaning of the generated gesture to generate a final gesture that is more natural than the generated gesture.
[0043] Specifically, for example, when the first gesture generation unit (210) encodes voice data or text data and combines the encoding result with pose data to generate a gesture, the second gesture generation unit (220) can generate a final gesture by reprocessing the gesture generated based on a type of data (text data or voice data) that was not used in the encoding.
[0044] In another example, when the first gesture generation unit (210) encodes voice data and text data respectively and combines each encoding result with pose data to generate a gesture, the second gesture generation unit (220) can reprocess the gesture generated based on the voice data and text data to generate a final gesture.
[0045] Meanwhile, the processor (200) may further include a verification unit (not shown) that verifies gesture quality (e.g., contextual consistency) in addition to the first gesture generation unit (210) and the second gesture generation unit (220).
[0046] The verification unit (not shown) can verify the quality of the gesture generated through the first gesture generation unit (210) and determine whether to reprocess the gesture generated through the first gesture generation unit (210) based on the verification result.
[0047] That is, the verification unit (not shown) verifies the quality of a gesture generated through the first gesture generation unit (210), and if it satisfies a preset quality standard, it can omit the reprocessing work of the second gesture generation unit (220) for the gesture and use it as the final gesture to provide it to the user.
[0048] Additionally, the verification unit (not shown) can verify the quality of the final gesture generated by reprocessing through the second gesture generation unit (220), and repeat the reprocessing according to the verification result.
[0049] For example, when the second gesture generation unit (220) generates a final gesture by reprocessing a gesture generated through the first gesture generation unit (210) based on voice data or text data, if the quality of the final gesture does not satisfy a preset quality standard by a verification unit (not shown), the final gesture can be updated by reprocessing the previously reprocessed gesture again based on data (text data or voice data) that was not used in the previous reprocessing.
[0050] FIG. 3 is a flowchart provided to describe a dual processing-based gesture processing method using voice or text according to an embodiment of the present invention.
[0051] The dual processing-based gesture processing method using voice or text according to the present embodiment can be executed by the system described above with reference to FIGS. 1 and 2.
[0052] Referring to FIG. 3, the system can encode input data (voice data and / or text data) obtained through the input unit (100) (S310), and combine the encoding result with pose data to generate a gesture (S320).
[0053] And the system can generate a final gesture by reprocessing the generated gesture based on input data (voice data and / or text data) (S330).
[0054] Through these two processing steps, this system can maintain a natural connection between voice or text data and gestures, improve the quality of gesture reconstruction, and ensure consistent gesture generation.
[0055] So far, a dual-processing-based gesture processing system and method utilizing voice or text have been described in detail with reference to preferred embodiments.
[0056] According to an embodiment of the present invention, by utilizing voice or text data to generate (first processing) a result of a gesture and reprocessing (second processing), contextual consistency can be enhanced and natural and meaningful gestures can be provided.
[0057] Meanwhile, it goes without saying that the technical concept of the present invention may also be applied to a computer-readable recording medium containing a computer program that enables the device and method according to the present embodiment to perform their functions. Furthermore, the technical concept according to various embodiments of the present invention may be implemented in the form of computer-readable code recorded on a computer-readable recording medium. A computer-readable recording medium may be any data storage device that can be read by a computer and store data. For example, a computer-readable recording medium may be a ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical disk, hard disk drive, etc. Additionally, computer-readable code or a program stored on a computer-readable recording medium may be transmitted through a network connected between computers.
[0058] Furthermore, although preferred embodiments of the present invention have been illustrated and described above, the present invention is not limited to the specific embodiments described above. Various modifications are possible by those skilled in the art without departing from the essence of the invention as claimed in the claims, and such modifications should not be understood individually from the technical spirit or perspective of the present invention.
Claims
1. The step in which the system acquires data; The system generates a gesture based on the encoding result of the acquired data; and A dual-processing-based gesture processing method utilizing voice or text, comprising the step of the system reprocessing a generated gesture based on acquired data to generate a final gesture.
2. In Claim 1, The step of acquiring data is, A dual-processing-based gesture processing method utilizing voice or text, characterized by acquiring voice data or text data and pose data.
3. In Claim 2, The step of generating a gesture is, A dual-processing-based gesture processing method utilizing voice or text, characterized by generating a gesture by combining the encoding result of acquired voice data or text data with pose data.
4. In Claim 2, The step of generating the final gesture is, A dual-processing-based gesture processing method utilizing voice or text, characterized by generating a final gesture by reprocessing a gesture generated based on acquired voice data or text data.
5. In Claim 4, The step of generating the final gesture is, A dual-processing-based gesture processing method utilizing voice or text, characterized by supplementing the contextual meaning of a generated gesture to generate a final gesture that is more natural than the generated gesture.
6. In Claim 1, A dual-processing-based gesture processing method utilizing voice or text, characterized by further including the step of encoding voice data or text data to convert the acquired voice data or text data into information that can be combined with pose data.
7. In Claim 1, Voice data, A dual-processing-based gesture processing method utilizing voice or text characterized by reflecting the speaker's tone and style.
8. In Claim 1, Text data, A dual-processing-based gesture processing method utilizing voice or text, characterized by including text content that reflects the speaker's style.
9. In Claim 1, Pose data, A dual-processing-based gesture processing method utilizing voice or text, characterized by including the speaker's style and pose patterns.
10. Input unit for acquiring data; and A dual-processing based gesture processing system utilizing voice or text, comprising: a processor that generates a gesture based on the encoding result of acquired data, and reprocesses the generated gesture based on the acquired data to generate a final gesture.
11. A step in which the system encodes voice data or text data; A step in which the system combines the encoding result of voice data or text data with pose data to generate a gesture; and A dual-processing-based gesture processing method utilizing voice or text, comprising the step of a system reprocessing a gesture generated based on acquired voice data or text data to generate a final gesture.
12. A first gesture generation unit that encodes voice data or text data and combines the encoding result of the voice data or text data with pose data to generate a gesture; and A dual-processing based gesture processing system utilizing voice or text, comprising: a second gesture generation unit that generates a final gesture by reprocessing a gesture generated based on acquired voice data or text data.