Video virtual hand language system facing digital television

A technology of digital TV and virtual human, applied in the field of video virtual human sign language system, to achieve the effect of natural action, balance of imaging quality and operation efficiency, and saving manpower and material resources

Inactive Publication Date: 2012-06-13
SUN YAT SEN UNIV
2 Cites 16 Cited by

AI-Extracted Technical Summary

Problems solved by technology

In addition, the system focuses on the generation of gestures, ignoring the fa...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Method used

2) adopt the smooth processing based on content, make the action between the gestures natural, and introduce the cooperation of facial expressions and gestures, make sign language expression more accurate and conform to reality;
In a word, the present invention superimposes sign language frame sequence and program video sequence by subtitle text generation; The generation of sign language frame sequence not only considers the generation of gesture but also adds the generation of facial expression, makes sign language expression more accurate, abundant ;Appropriate smoothing is done in the sign language frame sequence, so that the frames with large movement differences can be smoothly transitioned, and at the same time, the correlation in the gesture vector is used to simplify, and the number of patches can only be adjusted to improve the efficiency of operation , Finally, the modularized design of the present invention and the middleware of the system facilitate system transplantation.
Sign language generation module is the core module of this system, and it includes text analysis module, gesture generation module, expression generation module, gesture and expression synthesis module, frame sequence smoothing and simplified processing module and synchronous processing module. The input of the text analysis module is the text sequence of the subtitles, and the text analysis performs word segmentation on the subtitle sentences, and the obtained word segmentation is retrieved from the sign language database to obtain the corresponding gesture data and expression data. The present invention adopts the H-Anim (HumanoidAnimation) standard to model a virtual person. A gesture can be represented by a 56-element vector, and the abstract schematic diagram of the hand and arm is shown in Figure 2; A vector function representation of a collection. The face object can be represented by a three-dimensional mesh model, mainly through the face definition parameter (facial definition parameter, FDP) and the face animation parameter (facial animation parameter, FAP) to describe the shape, texture and other characteristics of the face and the characteristics of the face, respectively. The state of motion of the face. The gesture drawi...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Abstract

The invention discloses a video virtual hand language system facing a digital television. According to the system, de-multiplexing is carried out on a program source code stream, and voice, video and other data information are decoded, wherein other data information comprises caption text information; a caption text is inputted into a virtual hand language generation module which calls corresponding hand language data from a hand language database according to a text entry and carries out figure drafting to generate a hand language frame, and proper smooth processing is carried out between different gestures; and the hand language frame and voice information of a program are subjected to synchronous superposition and outputting. Through the video virtual hand language system, manpower and material resources are saved and preparation is standard, simultaneously, the smooth processing based on content is employed, motions of a gesture is natural, cooperation of face expression and the gesture is introduced, and gesture expression is more accurate and in accordance with reality.

Application Domain

Technology Topic

Image

  • Video virtual hand language system facing digital television
  • Video virtual hand language system facing digital television
  • Video virtual hand language system facing digital television

Examples

  • Experimental program(1)

Example Embodiment

[0032] The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
[0033]The embodiment of the present invention provides a digital TV-oriented video virtual human sign language system, which can save manpower and material resources and has the advantages of accurate specification, etc., which will be described in detail below.
[0034] The purpose of the present invention is to solve the above-mentioned defects in the prior art, and to provide a sign language system based on a virtual human with better effect. The main problems to be solved are: (1) Synchronous processing of sign language frames and program information; (2) Smooth processing of gesture movement; (3) Drawing of facial expressions coordinated gestures; (4) System integration and modularization.
[0035] The technical scheme adopted by the present invention is as follows: firstly, demultiplexing the program source code stream, decoding voice, video and other data information, wherein other data information includes subtitle text information; inputting the subtitle text into the virtual human sign language generation module, The module calls out the corresponding sign language data from the sign language database according to the text entry, and then draws the graphics to generate the sign language frame, and performs appropriate smoothing processing between different gestures; the sign language frame and the voice information of the program are synchronously superimposed and then output . For specific system diagrams, see figure 1.
[0036] The sign language generation module is the core module of the system, which includes a text parsing module, a gesture generation module, an expression generation module, a gesture and expression synthesis module, a frame sequence smoothing and simplification processing module and a synchronization processing module. The input of the text parsing module is the text sequence of the subtitle, the text parsing performs word segmentation on the subtitle sentence, and the obtained word segmentation is retrieved through the sign language database to obtain the corresponding gesture data and expression data. The present invention adopts the H-Anim (Humanoid Animation) standard to model the virtual human, and a gesture can be represented by a 56-element vector. The abstract schematic diagram of the hand and the arm is as follows: figure 2 shown; a sign language movement can be represented by a vector function from time to gesture set. The face object can be represented by a three-dimensional grid model, and the shape, texture and other characteristics of the face and the characteristics of the human face are described by the facial definition parameter (FDP) and the facial animation parameter (FAP) respectively. The state of motion of the face. Gesture drawing and facial expression drawing are based on OpenGL library, which has the characteristics of convenient implementation, mature algorithm and good portability. The sequence of sign language frames formed after drawing is not the final result, because there are differences in position and direction between different gestures, and some gestures are even very different. Smoothing between frames is performed. The 56-dimensional vector that defines the gesture, considering the correlation between these 56 factors, can further simplify the dimension and adapt dynamically, which is beneficial to the reduction of the amount of data and the improvement of the drawing speed. The sign language frame sequence is to be superimposed and fused with the program video frame, so the speed matching and synchronization between them is very necessary; the time information parsed from the text parsing module can mark the start time and end time of the subtitle, The sign language frame can be adjusted and synchronized according to these two times. At the same time, the synchronization between the program video frame sequence and the sign language frame also affects the smooth and simplified processing of the sign language frame sequence as a kind of feedback information.
[0037] For the process of generating sign language frames, see image 3 ,Specific steps are as follows:
[0038] Step1: The text parsing module obtains the subtitle text sequence from the subtitle text channel, parses the current subtitle text, and directly obtains the start time and end time of the subtitle for synchronization; generates gesture data by matching in the sign language database and expression data, go to step2;
[0039] Step2: Use OpenGL to draw according to the gesture data and expression data, generate a sequence of sign language frames, and go to step3;
[0040] Step3: Insert a corresponding number of smooth frames according to the difference between the gestures between frames, that is, perform smoothing processing, and use the information redundancy between gestures to simplify processing, and go to Step4;
[0041] Step4: Synchronize the sign language frame and program information by the time information, adjust the frame rate of the sign language frame, and also use the time information as feedback to adjust the smoothing and simplification processing;
[0042] Step5: Output the sequence of sign language frames as the input of video overlay, end.
[0043] The functions of the text parsing module include text editing input, text segmentation and conversion of Chinese words to sign language codes. The text editing input edits and preprocesses the input Chinese sentences so as to conform to the next text segmentation. Text segmentation divides sentences into words, and punctuation marks are separated into words; the word segmentation process of the system first uses the maximum matching method to segment, and then uses the word segmentation result of the first step to call the word rule by finding the ambiguity flag of the entry, and then perform ambiguity correction. The content contained in the basic thesaurus is the Chinese words corresponding to the sign language words that the synthesis system can synthesize. The content contained in the gesture library is the hand shape data of the sign language words that can be synthesized by the synthesis system, and the insinuation relationship between the data of the face expression and the sign language words is stored in the face expression library; The library and the facial expression library are collectively referred to as the gesture library, unless indicated separately. Chinese words and sign language words, as well as the mapping relationship to gestures and expressions, such as Figure 4 shown.
[0044] One problem to be solved by the present invention is the synchronization of sign language frames and program information. The invention is a convenient and feasible method to insert the start time and end time of subtitles in the subtitle sequence, which is more time-saving and labor-saving compared to the method of "singing lyrics", and is also more accurate. In fact, the production of subtitles already exists in the recording process of many programs, and it also includes the start time and end time of each sequence, so this is a relatively easy problem to solve. The other synchronization is determined by the characteristics of sign language itself. Sign language is a kind of body language that expresses meaning through the movement of hands and arms and changes in expressions. There is a big difference, so mechanically superimposing the sign language frame sequence and the program video sequence will inevitably lead to inconsistency in meaning. A frame brushing strategy based on context content. The time interval between frames is determined according to the degree of gesture change. When the change between the two frames is large, the time interval between the two frames is also large. On the contrary, if the movement between the two frames does not change much, the time between the two frames should be small. In addition, smoothing is performed between frames with large changes, and an appropriate amount of smoothing frames are inserted to make the action coherent.
[0045] The smoothness of the virtual human gesture movement directly affects the intelligibility of the gesture movement. The particularity of virtual human gesture movement is that it is an animation sequence spliced ​​by some meta-animation data, and there are great differences in gesture movements between two adjacent sign language words and different root words of the same sign language word. If no smoothing is done, the span between some actions is too large, and the speed will be too fast, which will lead to blurred vision. The solution is to interpolate some frames for smoothing based on the size of the difference between the two actions. To achieve the generation of the inserted frame, the Hermite interpolation algorithm can be used to interpolate the joint angle vector. The number of inserted frames depends on the gap between the two gestures. The larger the gap, the easier it is to insert more frames; on the contrary, the smaller the gap, the fewer frames to insert.
[0046] Sign language is a relatively stable expression system composed of gestures supplemented by facial expressions and gestures, so gestures alone will inevitably result in incomplete expressions. The present invention not only provides the generation of gesture actions in sign language, but also generates facial expressions. The present invention adopts the facial animation method based on MPEG-4 to generate facial animation. MPEG-4 is an object-based multimedia compression standard. Since people occupy a very important position in multimedia, MPEG-4 defines an international standard for the three-dimensional face animation format. MPEG-4 defines a facial definition parameter (FDP) and a facial animation parameter (FAP). The FDP defines the shape, texture and other features of the face, while the FAP describes the movement of the face. In the FDP parameter definition, 84 face feature points (FP) need to be determined, which describe the position and shape of the main parts of the face including eyes, eyebrows, mouth, tongue and teeth. MPEG-4 also includes 68 FAPs, including two advanced FAPs, viseme FAP and expression FAP. For the lip FAP, some basic and different lip shapes can be predefined, and other lip shapes can be linearly combined by these basic lip shapes. Expression FAP is also the same principle, which can be linearly combined with several basic expressions to produce various rich expressions. Except for the advanced FAP, other common FAPs respectively define the motion of a small area of ​​the face. The value of FAP is based on the facial animation parameter unit (FAPU). The purpose of using FAPU as the unit is to apply the same FAP parameters to different models, resulting in the same lip movements and expressions. , without aliasing lip movements and expressions depending on the model.
[0047] The generation of the facial expression involves the setting of the face definition parameter (FDP). The present invention uses the Xface tool to set the FDP for the three-dimensional face model. After defining the influence area and deformation function, for a set of input FAP parameter streams, the displacement of each vertex on the 3D face model at a certain moment can be calculated according to the MPEG-4 animation driving method, and finally drawn and rendered Face animation.
[0048] The generation of facial expressions also includes the extraction of facial animation parameters (FAP). In order to drive the 3D virtual human with natural expressions, it is necessary to obtain the FAP parameters of the basic expressions, such as happiness, sadness, anger, fear, disgust and surprise. In theory all facial expressions can be synthesized from these basic expressions.
[0049] Through the setting of face definition parameters and face motion parameters, combined with sign language data, an expression suitable for the current gesture can be selected, which further enhances the accuracy of the meaning.
[0050] In addition, in the video overlay part, a video overlay algorithm is implemented based on the RGB values ​​of the pixels. The process of video overlay can be described as: scan the main video image, position the pointer to the position that needs to be superimposed; scan the pixel value of the overlay image one by one, if it is a background pixel (use black as the background), skip it, if not, use This pixel value replaces the pixel value corresponding to the preset position in the main video; until the entire image is scanned. The real-time superposition of the video can be realized by repeating the above-mentioned superposition process for each image in the video.
[0051] The present invention modulates the sign language system, and makes it into the form of middleware, which is convenient for transplantation, and is suitable for running in different system platforms; and considering the rendering performance of different hardware platforms, the present invention makes corresponding adjustments according to the performance of the hardware: when the hardware When the performance is low, appropriately reduce the triangular facets representing the virtual human, sacrificing image quality in exchange for speed; on the contrary, when the platform hardware used allows, the number of triangular facets can be increased to obtain higher imaging quality.
[0052] In a word, the present invention generates a sign language frame sequence and a program video sequence by superimposing the subtitle text; the generation of the sign language frame sequence not only considers the generation of gestures but also adds the generation of facial expressions, so that the sign language expression is more accurate and rich; Appropriate smoothing processing is done in the frame sequence, so that the frames with large action differences can be smoothly transitioned. At the same time, the correlation in the gesture vector is used for simplification, and the number of patches can only be adjusted to improve the running efficiency. The invention of modular design and system middleware facilitates system transplantation.
[0053] The beneficial effects brought by the technical solution of the present invention:
[0054] 1) The use of the virtual human sign language system and the use of manual recording have the advantages of saving manpower and material resources and being accurate and standardized;
[0055] 2) The content-based smoothing process is adopted to make the movements between gestures natural, and the synergy between facial expressions and gestures is introduced to make the sign language expression more accurate and realistic;
[0056]3) According to the platform performance, intelligently adjust the number of triangular facets of the virtual human, and balance the imaging quality and operation efficiency;
[0057] 4) Modular design and middleware to facilitate the transplantation of the entire system.
[0058] The present invention adopts the facial animation method based on MPEG-4 in the generation of facial expressions, in addition to such methods as interpolation method, parameterization method, free deformation method, muscle model method, elastic mesh method, limited Meta-methods and other methods can be tried, and these methods have their own advantages.
[0059] In addition, there are many methods for realizing video overlay. In addition to overlaying with RGB values, video overlays based on luminance value, alpha value, hue, etc. can also be used.
[0060] It should be noted that the information exchange, execution process, etc. between the above-mentioned devices and units in the system are based on the same concept as the method embodiments of the present invention. For details, please refer to the descriptions in the method embodiments of the present invention. No longer.
[0061] Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable storage medium, and the storage medium can include: Read only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.
[0062] A video virtual human sign language system for digital TV provided by the embodiments of the present invention has been described in detail above. The principles and implementations of the present invention are described with specific examples. The descriptions of the above embodiments are only used for In order to help understand the method of the present invention and its core idea; at the same time, for those skilled in the art, according to the idea of ​​the present invention, there will be changes in the specific implementation and application scope. In summary, this specification The contents should not be construed as limiting the present invention.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Similar technology patents

Classification and recommendation of technical efficacy words

Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products