[0032] The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
[0033]The embodiment of the present invention provides a digital TV-oriented video virtual human sign language system, which can save manpower and material resources and has the advantages of accurate specification, etc., which will be described in detail below.
[0034] The purpose of the present invention is to solve the above-mentioned defects in the prior art, and to provide a sign language system based on a virtual human with better effect. The main problems to be solved are: (1) Synchronous processing of sign language frames and program information; (2) Smooth processing of gesture movement; (3) Drawing of facial expressions coordinated gestures; (4) System integration and modularization.
[0035] The technical scheme adopted by the present invention is as follows: firstly, demultiplexing the program source code stream, decoding voice, video and other data information, wherein other data information includes subtitle text information; inputting the subtitle text into the virtual human sign language generation module, The module calls out the corresponding sign language data from the sign language database according to the text entry, and then draws the graphics to generate the sign language frame, and performs appropriate smoothing processing between different gestures; the sign language frame and the voice information of the program are synchronously superimposed and then output . For specific system diagrams, see figure 1.
[0036] The sign language generation module is the core module of the system, which includes a text parsing module, a gesture generation module, an expression generation module, a gesture and expression synthesis module, a frame sequence smoothing and simplification processing module and a synchronization processing module. The input of the text parsing module is the text sequence of the subtitle, the text parsing performs word segmentation on the subtitle sentence, and the obtained word segmentation is retrieved through the sign language database to obtain the corresponding gesture data and expression data. The present invention adopts the H-Anim (Humanoid Animation) standard to model the virtual human, and a gesture can be represented by a 56-element vector. The abstract schematic diagram of the hand and the arm is as follows: figure 2 shown; a sign language movement can be represented by a vector function from time to gesture set. The face object can be represented by a three-dimensional grid model, and the shape, texture and other characteristics of the face and the characteristics of the human face are described by the facial definition parameter (FDP) and the facial animation parameter (FAP) respectively. The state of motion of the face. Gesture drawing and facial expression drawing are based on OpenGL library, which has the characteristics of convenient implementation, mature algorithm and good portability. The sequence of sign language frames formed after drawing is not the final result, because there are differences in position and direction between different gestures, and some gestures are even very different. Smoothing between frames is performed. The 56-dimensional vector that defines the gesture, considering the correlation between these 56 factors, can further simplify the dimension and adapt dynamically, which is beneficial to the reduction of the amount of data and the improvement of the drawing speed. The sign language frame sequence is to be superimposed and fused with the program video frame, so the speed matching and synchronization between them is very necessary; the time information parsed from the text parsing module can mark the start time and end time of the subtitle, The sign language frame can be adjusted and synchronized according to these two times. At the same time, the synchronization between the program video frame sequence and the sign language frame also affects the smooth and simplified processing of the sign language frame sequence as a kind of feedback information.
[0037] For the process of generating sign language frames, see image 3 ,Specific steps are as follows:
[0038] Step1: The text parsing module obtains the subtitle text sequence from the subtitle text channel, parses the current subtitle text, and directly obtains the start time and end time of the subtitle for synchronization; generates gesture data by matching in the sign language database and expression data, go to step2;
[0039] Step2: Use OpenGL to draw according to the gesture data and expression data, generate a sequence of sign language frames, and go to step3;
[0040] Step3: Insert a corresponding number of smooth frames according to the difference between the gestures between frames, that is, perform smoothing processing, and use the information redundancy between gestures to simplify processing, and go to Step4;
[0041] Step4: Synchronize the sign language frame and program information by the time information, adjust the frame rate of the sign language frame, and also use the time information as feedback to adjust the smoothing and simplification processing;
[0042] Step5: Output the sequence of sign language frames as the input of video overlay, end.
[0043] The functions of the text parsing module include text editing input, text segmentation and conversion of Chinese words to sign language codes. The text editing input edits and preprocesses the input Chinese sentences so as to conform to the next text segmentation. Text segmentation divides sentences into words, and punctuation marks are separated into words; the word segmentation process of the system first uses the maximum matching method to segment, and then uses the word segmentation result of the first step to call the word rule by finding the ambiguity flag of the entry, and then perform ambiguity correction. The content contained in the basic thesaurus is the Chinese words corresponding to the sign language words that the synthesis system can synthesize. The content contained in the gesture library is the hand shape data of the sign language words that can be synthesized by the synthesis system, and the insinuation relationship between the data of the face expression and the sign language words is stored in the face expression library; The library and the facial expression library are collectively referred to as the gesture library, unless indicated separately. Chinese words and sign language words, as well as the mapping relationship to gestures and expressions, such as Figure 4 shown.
[0044] One problem to be solved by the present invention is the synchronization of sign language frames and program information. The invention is a convenient and feasible method to insert the start time and end time of subtitles in the subtitle sequence, which is more time-saving and labor-saving compared to the method of "singing lyrics", and is also more accurate. In fact, the production of subtitles already exists in the recording process of many programs, and it also includes the start time and end time of each sequence, so this is a relatively easy problem to solve. The other synchronization is determined by the characteristics of sign language itself. Sign language is a kind of body language that expresses meaning through the movement of hands and arms and changes in expressions. There is a big difference, so mechanically superimposing the sign language frame sequence and the program video sequence will inevitably lead to inconsistency in meaning. A frame brushing strategy based on context content. The time interval between frames is determined according to the degree of gesture change. When the change between the two frames is large, the time interval between the two frames is also large. On the contrary, if the movement between the two frames does not change much, the time between the two frames should be small. In addition, smoothing is performed between frames with large changes, and an appropriate amount of smoothing frames are inserted to make the action coherent.
[0045] The smoothness of the virtual human gesture movement directly affects the intelligibility of the gesture movement. The particularity of virtual human gesture movement is that it is an animation sequence spliced by some meta-animation data, and there are great differences in gesture movements between two adjacent sign language words and different root words of the same sign language word. If no smoothing is done, the span between some actions is too large, and the speed will be too fast, which will lead to blurred vision. The solution is to interpolate some frames for smoothing based on the size of the difference between the two actions. To achieve the generation of the inserted frame, the Hermite interpolation algorithm can be used to interpolate the joint angle vector. The number of inserted frames depends on the gap between the two gestures. The larger the gap, the easier it is to insert more frames; on the contrary, the smaller the gap, the fewer frames to insert.
[0046] Sign language is a relatively stable expression system composed of gestures supplemented by facial expressions and gestures, so gestures alone will inevitably result in incomplete expressions. The present invention not only provides the generation of gesture actions in sign language, but also generates facial expressions. The present invention adopts the facial animation method based on MPEG-4 to generate facial animation. MPEG-4 is an object-based multimedia compression standard. Since people occupy a very important position in multimedia, MPEG-4 defines an international standard for the three-dimensional face animation format. MPEG-4 defines a facial definition parameter (FDP) and a facial animation parameter (FAP). The FDP defines the shape, texture and other features of the face, while the FAP describes the movement of the face. In the FDP parameter definition, 84 face feature points (FP) need to be determined, which describe the position and shape of the main parts of the face including eyes, eyebrows, mouth, tongue and teeth. MPEG-4 also includes 68 FAPs, including two advanced FAPs, viseme FAP and expression FAP. For the lip FAP, some basic and different lip shapes can be predefined, and other lip shapes can be linearly combined by these basic lip shapes. Expression FAP is also the same principle, which can be linearly combined with several basic expressions to produce various rich expressions. Except for the advanced FAP, other common FAPs respectively define the motion of a small area of the face. The value of FAP is based on the facial animation parameter unit (FAPU). The purpose of using FAPU as the unit is to apply the same FAP parameters to different models, resulting in the same lip movements and expressions. , without aliasing lip movements and expressions depending on the model.
[0047] The generation of the facial expression involves the setting of the face definition parameter (FDP). The present invention uses the Xface tool to set the FDP for the three-dimensional face model. After defining the influence area and deformation function, for a set of input FAP parameter streams, the displacement of each vertex on the 3D face model at a certain moment can be calculated according to the MPEG-4 animation driving method, and finally drawn and rendered Face animation.
[0048] The generation of facial expressions also includes the extraction of facial animation parameters (FAP). In order to drive the 3D virtual human with natural expressions, it is necessary to obtain the FAP parameters of the basic expressions, such as happiness, sadness, anger, fear, disgust and surprise. In theory all facial expressions can be synthesized from these basic expressions.
[0049] Through the setting of face definition parameters and face motion parameters, combined with sign language data, an expression suitable for the current gesture can be selected, which further enhances the accuracy of the meaning.
[0050] In addition, in the video overlay part, a video overlay algorithm is implemented based on the RGB values of the pixels. The process of video overlay can be described as: scan the main video image, position the pointer to the position that needs to be superimposed; scan the pixel value of the overlay image one by one, if it is a background pixel (use black as the background), skip it, if not, use This pixel value replaces the pixel value corresponding to the preset position in the main video; until the entire image is scanned. The real-time superposition of the video can be realized by repeating the above-mentioned superposition process for each image in the video.
[0051] The present invention modulates the sign language system, and makes it into the form of middleware, which is convenient for transplantation, and is suitable for running in different system platforms; and considering the rendering performance of different hardware platforms, the present invention makes corresponding adjustments according to the performance of the hardware: when the hardware When the performance is low, appropriately reduce the triangular facets representing the virtual human, sacrificing image quality in exchange for speed; on the contrary, when the platform hardware used allows, the number of triangular facets can be increased to obtain higher imaging quality.
[0052] In a word, the present invention generates a sign language frame sequence and a program video sequence by superimposing the subtitle text; the generation of the sign language frame sequence not only considers the generation of gestures but also adds the generation of facial expressions, so that the sign language expression is more accurate and rich; Appropriate smoothing processing is done in the frame sequence, so that the frames with large action differences can be smoothly transitioned. At the same time, the correlation in the gesture vector is used for simplification, and the number of patches can only be adjusted to improve the running efficiency. The invention of modular design and system middleware facilitates system transplantation.
[0053] The beneficial effects brought by the technical solution of the present invention:
[0054] 1) The use of the virtual human sign language system and the use of manual recording have the advantages of saving manpower and material resources and being accurate and standardized;
[0055] 2) The content-based smoothing process is adopted to make the movements between gestures natural, and the synergy between facial expressions and gestures is introduced to make the sign language expression more accurate and realistic;
[0056]3) According to the platform performance, intelligently adjust the number of triangular facets of the virtual human, and balance the imaging quality and operation efficiency;
[0057] 4) Modular design and middleware to facilitate the transplantation of the entire system.
[0058] The present invention adopts the facial animation method based on MPEG-4 in the generation of facial expressions, in addition to such methods as interpolation method, parameterization method, free deformation method, muscle model method, elastic mesh method, limited Meta-methods and other methods can be tried, and these methods have their own advantages.
[0059] In addition, there are many methods for realizing video overlay. In addition to overlaying with RGB values, video overlays based on luminance value, alpha value, hue, etc. can also be used.
[0060] It should be noted that the information exchange, execution process, etc. between the above-mentioned devices and units in the system are based on the same concept as the method embodiments of the present invention. For details, please refer to the descriptions in the method embodiments of the present invention. No longer.
[0061] Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable storage medium, and the storage medium can include: Read only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.
[0062] A video virtual human sign language system for digital TV provided by the embodiments of the present invention has been described in detail above. The principles and implementations of the present invention are described with specific examples. The descriptions of the above embodiments are only used for In order to help understand the method of the present invention and its core idea; at the same time, for those skilled in the art, according to the idea of the present invention, there will be changes in the specific implementation and application scope. In summary, this specification The contents should not be construed as limiting the present invention.