Lip reading technology based lip language input method

An input method and lip language technology, applied in the input/output of user/computer interaction, graphic reading, instruments, etc., to achieve the effect of strong practicability and recognition accuracy, strong pertinence, improved accuracy and rapidity

Inactive Publication Date: 2013-05-08
NANKAI UNIV
4 Cites 36 Cited by

AI-Extracted Technical Summary

Problems solved by technology

[0009] Harbin Institute of Technology, Institute of Acoustics, Chinese Academy of Sciences and other institutions in China are also committed to the research of this topic, but they are still in the s...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Method used

For obtained normalized lip color static picture, utilize OpenCV storehouse function to carry out gradation and median filtering process to picture, then utilize Otsu method (Otsu method) to calculate the binarization threshold of picture. Use this threshold to binarize the smoothed grayscale image. In this way, the effect of adaptive acquisition threshold is achieved. For the binarized image, scan each pixel in the image to determine whether it is an isolated point. The isolated points are removed during the scanning process, thus having a good denoising effect on the binarized image. The picture obtained through the above steps is a normalized lip binarized picture.
[0072] The feature vector obtained from the processed picture is matched with the template in t...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Abstract

The invention relates to a lip reading technology based lip language input method, mainly aims at Chinese characters in common use and Arabic numerals, belongs to the intelligent computer identification technology, is a typical image mode analysis, understanding and classified calculation problem and relates to multiple subjects of mode identification, computer vision, intelligent human-computer interaction, cognitive science and the like. Shot mouth lip movement video is subjected to key frame extraction, normalizing processing is performed for extracted images by utilizing gray processing, median filtering, dynamic threshold binarization processing and scanning for noisy point removal, then eigenvectors are extracted to obtain parameters with lip characteristics and are matched with a lip model library to identify the images as a Chinese phonetic alphabet sequence, and finally an input method module is combined to obtain corresponding Chinese characters or Arabic numerals.

Application Domain

Technology Topic

Image modeMouth lips +11

Image

  • Lip reading technology based lip language input method
  • Lip reading technology based lip language input method
  • Lip reading technology based lip language input method

Examples

  • Experimental program(1)

Example Embodiment

[0025] The following will introduce the lip language input system and its implementation method based on lip reading technology:
[0026] Firstly, the system camera is used to locate the person's lips, and the lip motion video containing only the speaker's lips is collected, and the key frame image in the video stream is obtained by using the key frame extraction technology.
[0027] For the obtained normalized lip color static image, the OpenCV library function is used to perform grayscale and median filtering processing on the image, and then the Otsu method (Otsu method) is used to calculate the binarization threshold of the image. This threshold is used to binarize the smoothed grayscale image. Thus, the effect of adaptive acquisition threshold is achieved. For the binarized image, scan each pixel in the image to determine whether it is an isolated point. The isolated points are removed during the scanning process, which has a good denoising effect on the binarized image. The image obtained through the above steps is a normalized lip binarization image.
[0028] The Otsu method was proposed by the Japanese scholar Otsu in 1979. It is a method of adaptively determining the threshold, also known as the maximum inter-class variance method, or OTSU for short. It divides the image into two parts, the background and the target, according to the grayscale characteristics of the image. The greater the inter-class variance between the background and the target, the greater the difference between the two parts of the image. When part of the target is mistakenly classified as the background or part of the background is mistakenly classified as the target, the difference between the two parts will become smaller. Therefore, the segmentation that maximizes the variance between classes means that the probability of misclassification is minimized.
[0029] For the image Image (assuming that the average gray level of the target is less than the average gray level of the background), mark T as the threshold (initialized as 1), and the proportion of target points in the image is W 0 , the average gray level is U 0; The proportion of background points to the image is W 1 , the average gray level is U 1. Then the total average gray level of the image U=W 0 *U 0 +W 1 *U 1. Traverse T from the minimum gray value to the maximum gray value, when T makes the between-class variance G=W 0 *(U 0 -U) 2 +W1 *(U 1 -U) 2 When the maximum value is reached, T is the optimal threshold.
[0030] Improvements to the calculation process: Assume that the image size is M*N (then the total number of pixels is M*N). The number of pixels of the target is denoted as N 0 , the number of pixels of the background is N 1 , then there are:
[0031] W 0 =N 0 /(M*N) …(1)
[0032] W 1 =N 1 /(M*N) …(2)
[0033] N 0 +N 1 =M*N …(3)
[0034] W 0 +W 1 =1...(4)
[0035] U=W 0 *U 0 +W 1 *U 1 …(5)
[0036] G=W 0 *(U 0 -U) 2 +W 1 *(U 1 -U) 2 …(6)
[0037] Substitute equation (5) into equation (6) to get the equivalent formula: G=W 0 *W 1 *(U 0 -U 1 ) 2 …(7). Using formula (7) when writing programs can reduce a large number of calculation steps and improve program efficiency.
[0038] Using the template method, the lip feature vector is obtained. This method abstracts the lip contour as a mathematical model, and approximates the actual position of the lip with several curves or special point sets to obtain the geometric shape features of the lips, which are expressed as a small set of parameters, such as the variable template method (DTW ), Active Shape Model (ASM), Active Appearance Model (AAM), etc., which have the advantage that important features are represented as low-dimensional vectors and are generally not changed by translation, rotation, scaling, or illumination.
[0039] Since we get a static image after extracting the key frame, in order to extract the features of the lips, we choose the double-lip mouth template, as shown in the appendix figure 2 shown. The template is mainly composed of two kinds of curves, parabola and quartic. The inner lip is described by two parabolas, and the outer lip is described by quartic. Since the quartic curve has a stronger ability to describe changes than the quadratic curve, it can be seen that the description of the outer lip is more detailed than that of the inner lip, and it can accurately reflect the shape of the outer lip.
[0040] Based on the double-lip mouth template, including two parabolas of the inner lip and 3 quartic curves of the outer lip and 12 parameters: (X c , Y c ): the center coordinate of the lips determines the position of the mouth coordinate system, θ: determines the direction of the mouth coordinate system, W 1 : Inner lip width, W 0 : outer lip width, h 1 : Height of upper edge of outer lip, h 2 : Height of upper edge of inner lip, h 3 : Height of lower edge of inner lip, h 4 : Height of lower edge of outer lip, a off : Offset of the center of the quartic curve from the coordinate origin, q 0 : the distance that the quartic curve of the upper lip deviates from the parabola, q 1 : The distance that the quartic curve of the lower lip deviates from the parabola.
[0041] The mouth template is a parameterized physical model, which is a mouth shape described by multiple curves. The model parameters are adjusted by the optimization method to minimize the cost function, that is, the template gradually approaches the real position of the lip shape. The mouth shape template map is determined according to the experience information of people's daily mouth shape, and the mouth shape template is used to describe the shape of the lips, which can be applied to most mouth shapes.
[0042] The template is mainly composed of two kinds of curves: parabola and quartic. The inner lip is described by two parabolas, and the outer lip is described by quartic. Model parameter description: center coordinates (x c , y c ) and the angle of rotation θ determine the position and orientation of the mouth coordinate system to which all lengths and dimensions are referenced, and the mouth is assumed to be symmetric about the ordinate. The outer edge of the upper lip is described by two quartic curves whose centers are offset from the coordinate origin by a off , the height is h 1 , forcing them to share the same parameters, thus establishing the symmetry of the template. The parameter q in the quartic curve 0 Indicates how far the quartic curve deviates from the parabola, which allows the template to more precisely approximate the shape of the lip. The outer edge of the lower lip is also described as a quartic curve with height h 4 , the auxiliary parameter q 1 The variability provided allows it to track the lower lip of the speaker more accurately than a parabola. The upper and lower outer lips have the same width w 0 , intersecting at (-w 0 , 0), (w 0 , 0). Two parabolas are used to describe the inner edge of the mouth, the height of each parabola is h 2 , h 3 , whose width is w 1 , intersecting at (-w 1 , 0), (w 1 , 0).
[0043] Thus, the mouth model can be determined with the following parameters: (x c , y c ), θ, w 0 , w 1 , a off , h 1 , h 2 , h 3 , h 4 , q 0 , q 1 , the curve equation of its contour is as follows:
[0044] Y ul = h 1 × ( 1 - ( x + a off ) 2 ( w 0 - a off ) 2 ) + 4 q 0 × ( ( x + a off ) 4 ( w 0 - a off ) 4 - ( x + a off ) 2 ( w 0 - a off ) 2 ) ,
[0045] Y ur = h 1 × ( 1 - ( x - a off ) 2 ( w 0 - a off ) 2 ) + 4 q 0 × ( ( x - a off ) 4 ( w 0 - a off ) 4 - ( x - a off ) 2 ( w 0 - a off ) 2 ) ,
[0046] Y ui = h 2 × ( 1 - x 2 w 0 2 ) ,
[0047] Y li = h 3 × ( 1 - x 2 w 0 2 ) ,
[0048] Y l = - h 4 × ( 1 - x 2 w 1 2 ) - 4 q 0 × ( x 4 w 1 4 - x 2 w 1 2 ) ,
[0049] Based on the lip template method, we obtained about (w 0 , w 1 , a off , h 1 , h 2 , h 3 , h 4 , q 0 , q 1 ) of the 9-dimensional mouth shape feature vector, the shape feature has good robustness to the translation and expansion of the image, and can be used for the next feature vector recognition.
[0050] The mouth template library is composed of multiple sets of standard Hanyu Pinyin pronunciation mouth shapes and feature vectors extracted by the template method, including all initials and finals in Hanyu Pinyin, which are entered by specific experimenters to ensure the correct rate of later matching. In the application, only one or more sets of standard mouth shapes are required to be entered by the user, and the feature vector is automatically extracted by the system as a matching template.
[0051] The construction of the template library is divided into the following steps:
[0052] A. For a specific person, obtain the lip movement video material of his pronunciation. In the actual application process, pictures are generally taken statically under the condition of lip movement. Therefore, when establishing a mouth template library, the key frames are intercepted from the video, and the most suitable sets of pictures are selected as the mouth template library. preliminary template image.
[0053] B. Normalize the preliminary template picture. When constructing the mouth template library, a template image of a specific size is used, that is, a reasonable cutting process is performed on the preliminary template image, so as to obtain a normalized image with a certain pixel size.
[0054] C. The first-stage task of the mouth-shape template library is completed, and the second-stage process is performed on the template images of all mouth-shape template libraries (see Feature Vector Extraction for specific operations) to obtain a complete mouth-shape template library.
[0055] After the template library is built, it is necessary to perform cluster analysis on the images and their parameters to facilitate matching later. In the early days, we used fewer syllables, so we first divided them into two groups based on open and closed accents, and then selected appropriate parameters according to the extracted parameters, and further classified the syllables according to a certain algorithm.
[0056] The choice of clustering method - K-means algorithm, namely K-Means algorithm. After the feature points are extracted, the data is processed according to the distance between the feature points, and the K-means algorithm is a typical distance-based clustering algorithm, which uses the distance as the evaluation index of similarity, that is, it is considered that two The closer the objects are, the greater their similarity. The algorithm considers clusters to be composed of objects that are close in distance, so it takes compact and independent clusters as the ultimate goal.
[0057] The algorithm process is as follows:
[0058] (1) arbitrarily select k objects from n data objects as initial clustering centers;
[0059] (2) Calculate the distance between each object and these central objects according to the mean (center object) of each clustered object;
[0060] And re-divide the corresponding objects according to the minimum distance;
[0061] (3) Recalculate the mean (center object) of each (changed) cluster;
[0062] (4) Loop (2) to (3) until each cluster no longer changes.
[0063] details as follows:
[0064] input: k, data[n];
[0065] (1) Select k initial center points, for example, c[0]=data[0], ... c[k-1]=data[k-1];
[0066] (2) For data[0]....data[n], compare with c[0]...c[k-1] respectively, assuming that the difference with c[i] is the least, mark it as i;
[0067] (3) For all points marked i, recalculate c[i]={sum of all data[j] marked i}/number of points marked i;
[0068] (4) Repeat (2) (3) until the change of all c[i] values ​​is less than a given threshold.
[0069]It can be seen from the algorithm that the key to the success of this algorithm lies in the selection of k initial centers and the number of class centers that are expected to be finally obtained. Here, the initial center k=2 is used, and it is a set of vectors for any two open and closed accents As two cluster centers, we hope to get the final result into two categories: one for open accents (a and o), and one for closed accents (e, i, u, ü).
[0070] When implementing the K-Means algorithm, this project tries to use the off-the-shelf data mining software-weka to combine with programming to get the clustering results. Using the convenience of weka to process data and its powerful functions, the corresponding lengths w0, w1, h0.h1, h2, h3, h4... Enter weka to perform preliminary clustering, change the size of the K value to observe the clustering results, and initially obtain a suitable k value.
[0071] According to the k value obtained by weka, the VC program is used for further clustering, and the value of the initial k center points can be determined in VC instead of randomly selected, which avoids the selection of k parameter values ​​corresponding to the same mouth shape, and in VC In the program, it is also necessary to choose the optimal threshold, that is, how to input the parameter vectors obtained from all the template library pictures into the K-Means algorithm in the most reasonable arrangement. in the correct order to extract the next vector to be put into the program for calculation according to the interval i. That is, 6 syllables, 20 pictures of each syllable are arranged in order, first select any one of the two types of pictures as the cluster center of K=2, then select the interval, input the picture at a time, and finally put the All 120 image parameter vectors are input to get the final clustering result.
[0072] The feature vector obtained from the processed image is matched with the template in the Chinese pinyin alphabet lip-shape template library to obtain the pinyin letters represented by the lip-shape. When matching, the variance of the two sets of vectors and the correlation coefficient between the two sets of vectors are calculated, and the two calculation results are combined to obtain the optimal matching result. Then the pronunciation of the letters represented by the set of vectors can be used as the output of the feature vector matching part. Finally, the pinyin letter sequence of the sentence is obtained and processed into Chinese by the input method. It should be noted that, in the feature extraction module, the present invention extracts the feature parameters of the inner lip, in order to improve the accuracy of the subsequent process.
[0073] The text output module obtains the pinyin alphabet sequence and then submits it to the existing input method for intelligent matching, and obtains the best word, sentence or Arabic numeral matching the pinyin alphabet sequence under the assistance of the user for output.
[0074] In the early stage of our experimental system, the Android emulator running on the personal computer was used, and the eclipse development tool of IBM Corporation was used to load the Android 2.2 system plug-in to simulate the environment. Later, the HTC G8 mobile phone was used as the experimental platform. The mobile phone adopts Google Android 2.1 operating system, built-in Qualcomm MSM7225 processor, main frequency 528MHz, memory ROM 512MB, RAM 384MB. Built-in 5 million pixel camera, screen resolution of 240 × 320 pixels.
[0075] The lip language input system on Andriod platform consists of the following parts: image acquisition unit, image preprocessing unit, feature extraction unit, mouth template library and lip language recognition unit. The image acquisition unit uses the Android system call to collect a static image including the lips from the camera, and submits it to the image preprocessing unit for dynamic threshold extraction, binarization, filtering and denoising to obtain an image pixel matrix, and the feature parameter extraction unit uses the image pixels. The parameters are extracted from the matrix, and finally the feature matching unit uses the variance method to match the mouth template library to obtain the result.
[0076] The specific implementation method of the lip language input system applied to the Andriod platform is as before, here is a special description of the image acquisition unit:
[0077] The interface for obtaining photos through the camera under the Android platform is implemented by the class Camera. In order to visualize the preview content of the camera on the screen, we need to use the SurfaceHolder control, and when the Camera class instance uses the setPreviewDisplay method, set the class SurfaceHolder instance as a parameter. When we initialize an instance of the SurfaceHolder class, we need to implement three interface functions, surfaceChanged, surfaceCreated, and surfaceDestroyed.
[0078] In the class Camera, including the takePicture method, by filling a jpegCallback function reference into its parameter list, you can read the bitmap stream of the image obtained by the camera in the jpegCallback function body, take it as a parameter, and call the decodeByteArray method of the BitmapFactory class, Get the captured image.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Similar technology patents

Hand action capturing method for virtual reality system

InactiveCN106326870AImprove accuracy and speedMeet real-time captureImage enhancementImage analysisBinary imageSkin color
Owner:SHANGHAI MENGYUN MOVE SOFT NETWORK TECH CO LTD

Sanqi data processing system

InactiveCN108647298AImprove the speed of recognitionImprove accuracy and speedMarketingSpecial data processing applicationsSTEP-NCMarket forecast
Owner:亳州中药材商品交易中心有限公司

Classification and recommendation of technical efficacy words

Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products