Model generation methods, facial landmark detection methods, devices, and electronic equipment
By training a 3D deformable face model to generate a visual encoder, prior knowledge is provided for the facial landmark detection model, which solves the problem of low accuracy in facial landmark detection. Especially under conditions of large facial expressions and poor lighting, higher detection accuracy and model interpretability are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GUANGDONG OPPO MOBILE TELECOMMUNICATIONS CORP LTD
- Filing Date
- 2022-06-15
- Publication Date
- 2026-06-30
AI Technical Summary
In existing technologies, the accuracy of facial landmark detection is low, especially under conditions of large facial expressions and poor lighting, and the interpretability of convolutional neural networks is limited.
The initial visual encoder is trained using training data and a pre-trained 3D deformable face model to generate a visual encoder that acquires 2D and 3D information of face images. The initial face key point detection model is then trained based on the visual encoder to provide prior knowledge and improve detection accuracy.
It improves the accuracy of facial landmark detection, especially under conditions of large facial expressions and poor lighting, and enhances the interpretability of the model.
Smart Images

Figure CN117274765B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, and more specifically, to a model generation method, a facial landmark detection method, an apparatus, and an electronic device. Background Technology
[0002] With the development of computer vision and artificial intelligence technologies, facial landmark detection has become a research hotspot. One approach involves using convolutional neural networks to extract key point information from two-dimensional facial images. However, the accuracy of key point detection using these methods still needs improvement. Summary of the Invention
[0003] In view of the above problems, this application proposes a model generation method, a facial landmark detection method, an apparatus, and an electronic device to improve the above problems.
[0004] In a first aspect, this application provides a model generation method, the method comprising: acquiring training data, the training data including face images; training an initial visual encoder based on the face images and a pre-trained three-dimensional deformable face model to obtain a visual encoder, wherein the three-dimensional deformable face model is used to acquire three-dimensional information corresponding to the face image, and the visual encoder is used to acquire two-dimensional information of the face image; and training an initial facial landmark detection model based on the face images and the visual encoder to obtain a facial landmark detection model.
[0005] Secondly, this application provides a method for detecting facial landmarks, the method comprising: acquiring a face image to be processed; inputting the face image into a facial landmark detection model obtained based on the above method to obtain the coordinates of the landmarks corresponding to the face image.
[0006] Thirdly, this application provides a model generation apparatus, the apparatus comprising: a training data acquisition unit for acquiring training data; a visual encoder generation unit for training an initial visual encoder based on the training data to obtain a three-dimensional deformable face model for acquiring three-dimensional information corresponding to the face image, wherein the visual encoder is used to acquire two-dimensional information of the face image; and a model generation unit for training an initial face key point detection model based on the training data and the visual encoder to obtain a face key point detection model.
[0007] Fourthly, this application provides a facial key point detection device, the device comprising: a facial image acquisition unit for acquiring a facial image to be processed; and a detection result generation unit for inputting the facial image into a facial key point detection model obtained based on the above method to obtain the key point coordinates corresponding to the facial image.
[0008] Fifthly, this application provides an electronic device including one or more processors and a memory; one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs being configured to perform the methods described above.
[0009] Sixthly, this application provides a computer-readable storage medium storing program code, wherein the above-described method is executed when the program code is run.
[0010] This application provides a model generation method, a facial landmark detection method, an apparatus, an electronic device, and a storage medium. After acquiring training data including facial images, an initial visual encoder is trained based on the facial images and a pre-trained 3D deformable face model to obtain a visual encoder. The 3D deformable face model is used to acquire the 3D information corresponding to the facial image, and the visual encoder is used to acquire the 2D information of the facial image. An initial facial landmark detection model is then trained based on the facial image and the visual encoder to obtain a facial landmark detection model. This method allows the initial visual encoder to be trained first using training data and a pre-trained 3D deformable face model, enabling it to learn both 2D and 3D information from facial images. Then, the initial facial landmark detection model is trained using the visual encoder to obtain a facial landmark detection model. This allows the visual encoder to provide prior knowledge to the initial facial landmark detection model, enabling it to learn the 2D and 3D information corresponding to the facial image, thus improving the accuracy of landmark detection. Attached Figure Description
[0011] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0012] Figure 1 A flowchart of a model generation method proposed in an embodiment of this application is shown;
[0013] Figure 2 This application shows Figure 1 A flowchart of one embodiment of S120;
[0014] Figure 3 A schematic diagram of an initial visual encoder proposed in this application is shown;
[0015] Figure 4 A schematic diagram of an initial visual encoder training process proposed in this application is shown;
[0016] Figure 5 A flowchart of a model generation method according to another embodiment of this application is shown;
[0017] Figure 6 A schematic diagram of an initial face landmark detection model proposed in this application is shown;
[0018] Figure 7 A schematic diagram of the initial face landmark detection model training process proposed in this application is shown;
[0019] Figure 8 A flowchart of a facial landmark detection method proposed in this application is shown;
[0020] Figure 9 A structural block diagram of a model generation apparatus according to an embodiment of this application is shown;
[0021] Figure 10 This paper shows a structural block diagram of a face key point detection device according to an embodiment of the present application;
[0022] Figure 11 A structural block diagram of an electronic device proposed in this application is shown;
[0023] Figure 12 This is a storage unit in this application embodiment for storing or carrying program code that implements the model generation method and the face key point detection method according to the embodiments of this application. Detailed Implementation
[0024] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. All other embodiments obtained by those skilled in the art based on the embodiments of this application without creative effort are within the scope of protection of this application.
[0025] With the advancement of technology, facial landmark detection has become a research hotspot. Facial landmark detection refers to the process of locating pre-defined facial landmarks with semantic information, such as key points in areas like eyebrows and eyes, from a given facial image. Based on the detection results, face-related applications such as face recognition, expression analysis, and face editing can be implemented.
[0026] However, the inventors discovered in their research that facial landmark detection still suffers from low accuracy. For example, while convolutional neural networks (CNNs) can extract landmarks from 2D face images, their interpretability is limited, and they are prone to overfitting with a limited number of training samples. Furthermore, while three rotation angles representing different head orientations can be introduced to assist CNNs in extracting overall head pose information, this method cannot extract features such as facial expressions and lighting conditions. The accuracy of landmark detection for samples with large facial expressions or poor lighting still needs improvement.
[0027] Therefore, the inventors have proposed a model generation method, a facial landmark detection method, an apparatus, and an electronic device in this application. After acquiring training data including a facial image, an initial visual encoder is trained based on the facial image and a pre-trained 3D deformable facial model to obtain a visual encoder. The 3D deformable facial model is used to acquire the 3D information corresponding to the facial image, and the visual encoder is used to acquire the 2D information of the facial image. An initial facial landmark detection model is then trained based on the facial image and the visual encoder to obtain a facial landmark detection model. This approach allows the initial visual encoder to be trained first using training data and a pre-trained 3D deformable facial model, enabling it to learn both 2D and 3D information from the facial image. Then, the initial facial landmark detection model is trained using the visual encoder to obtain a facial landmark detection model. This allows the visual encoder to provide prior knowledge to the initial facial landmark detection model, enabling it to learn the 2D and 3D information corresponding to the facial image through the visual encoder, thus improving the accuracy of landmark detection.
[0028] To better understand the solutions of the embodiments of this application, the technical terms used in the embodiments of this application will be explained below.
[0029] 3D Morphable Models (3DMMs) are models that can reconstruct human faces in three dimensions. By training a set of face image samples, a 3DMM model can find a set of orthonormal bases and linearly project the face onto the vector space composed of these orthonormal bases. This allows each face image to be linearly represented by these orthonormal bases, so that each 3D face can be obtained by linearly combining many other faces. Thus, the face reconstruction process is transformed into using a linear combination of samples in a face database to represent the face that best matches the input face image.
[0030] 3D face mesh: refers to a mesh diagram representing the 3D topological structure of a face. The 3D topological structure can be obtained by triangulating a face image.
[0031] Facial landmark detection: This refers to the detection of key points and their locations on a person's facial features and contours. Key points can include facial contours, eyes, eyebrows, lips, and nose contours, etc.
[0032] MobileNet-V2 is a lightweight neural network that uses depthwise separable convolution. This means that the convolution operator is replaced with a decomposed convolution operator. MobileNet-V2 can parse the convolution into two independent layers: first, a depthwise convolution, which can perform lightweight filtering by applying different individual convolutional filters to different input channels; and then, a second layer, which can be a pointwise convolution with a kernel size of 1x1, which is mainly responsible for weighted summation of the input channels (that is, calculating the linear combination of the input channels) to construct a new feature map.
[0033] The embodiments of this application will now be described with reference to the accompanying drawings.
[0034] Please see Figure 1 This application provides a model generation method, the method comprising:
[0035] S110: Obtain training data, which includes face images.
[0036] One approach is to use publicly available datasets corresponding to facial landmark detection methods as training data, such as FWLW, 300W, and COFW.
[0037] As another approach, training data can be collected using image acquisition devices such as cameras.
[0038] S120: The initial visual encoder is trained based on the face image and the pre-trained three-dimensional deformable face model to obtain the visual encoder. The three-dimensional deformable face model is used to obtain the three-dimensional information corresponding to the face image, and the visual encoder is used to obtain the two-dimensional information of the face image.
[0039] The training data also includes labels corresponding to the face images, which can represent the real coordinates of key points included in the corresponding face images. Representation vectors can be used to represent information corresponding to the face images, including face identity information, facial expression information, face texture information, image lighting parameters, and image camera parameters.
[0040] As a way, such as Figure 2As shown, the initial visual encoder is trained based on the face image and a pre-trained 3D deformable face model to obtain the visual encoder, including:
[0041] S121: Input the face image into the initial visual encoder to obtain a first representation vector corresponding to the face image, the first representation vector being used to represent the information corresponding to the face image.
[0042] The face image can be a two-dimensional color image, and its dimensions can be H×W×C. , The height, width, and number of channels of the input face image can be represented separately. Since the face image is a color image, the number of channels can be 3. The 3 channels can represent the three primary optical colors, namely the R (Red) channel, the G (Green) channel, and the B (Blue) channel.
[0043] In the embodiments of this application, such as Figure 3 As shown, a visual encoder can include a feature extraction network and fully connected layers. The feature extraction network can be used to extract features from a face image, and can be networks such as ResNet18 or MobileNet. The fully connected layers can be used to reduce the dimensionality of the extracted features to output a first representation vector.
[0044] One approach is to first input the face image into a feature extraction network to obtain the features corresponding to the face image, and then input the features into a fully connected layer to obtain the first representation vector.
[0045] S122: Input the first representation vector into the pre-trained three-dimensional deformable face model to obtain a face triangle mesh.
[0046] Among them, the face triangular mesh can be used to extract key points in face images and reconstruct face images.
[0047] As a way, such as Figure 4 As shown, the first representation vector can be input into a pre-trained 3D deformable face model to obtain a face triangular mesh.
[0048] In this embodiment, the initial visual encoder is trained using a pre-trained 3DMM model, which enables visualization of the intermediate training results (3D face mesh), thereby enhancing the interpretability of the model.
[0049] S123: Based on the face triangular mesh and the first representation vector, a first prediction label and a reconstructed face image are obtained. The first prediction label represents the predicted coordinates of key points included in the face image obtained based on the three-dimensional deformable face model.
[0050] The first representation vector can include camera model parameters, face identity parameters, expression parameters, texture parameters, and lighting model parameters. Camera model parameters can be used to map 3D information onto a 2D plane; face identity parameters can be used to identify the true identity of a face in a face image; expression parameters can be used to reconstruct facial expression information in a face image; texture parameters can be used to reconstruct texture information in a face image; and lighting model parameters can be used to reconstruct lighting information in a face image. The first representation vector can be represented by the following formula:
[0051]
[0052] in, It can represent the input face image. This can represent the initial visual encoder. These can represent facial identity parameters, expression parameters, and texture parameters, respectively. It can represent camera model parameters. It can represent lighting model parameters. R can represent the set of real numbers.
[0053] As a way, such as Figure 4 As shown, the first predicted label can be obtained based on the face triangle mesh and camera model parameters; the reconstructed face image can be obtained based on the face triangle mesh, camera model parameters, face identity parameters, expression parameters, texture parameters and lighting model parameters.
[0054] Optionally, the key points in the face triangle mesh can be mapped from three-dimensional space to a two-dimensional plane using camera model parameters to obtain the first predicted label.
[0055] Optionally, the camera model parameters can be used to map the face triangular mesh from three-dimensional space to a two-dimensional plane to obtain a two-dimensional face contour. Then, based on the face identity parameters, expression parameters, and texture parameters, the two-dimensional face contour can be reconstructed using the R, G, and B channels to obtain a face reconstruction image.
[0056] S124: The initial visual encoder is trained based on the face image, the label, the first predicted label, and the reconstructed face image to obtain a visual encoder.
[0057] In one approach, a first loss can be obtained based on the pixel values at all locations in the face image and the corresponding pixel values in the reconstructed face image. The first loss is used to reduce the difference between the pixels in the face image and the pixels in the reconstructed face image. A second loss is obtained based on the label and the first predicted label. The second loss is used to reduce the difference between the true coordinates and the predicted coordinates. Then, the initial visual encoder is trained based on the first loss and the second loss to obtain the visual encoder.
[0058] Since the face image and the reconstructed face image can be color images, they can be represented by a pixel matrix with three channels. The values in the pixel matrix can represent the pixel values at corresponding positions in the image, and the pixel values at all positions can refer to all the values included in the pixel matrix corresponding to the three channels.
[0059] Optionally, multiple face images can be used, each corresponding to a reconstructed face image. The pixel matrix of each face image and its corresponding reconstructed face image can be obtained. The pixel values of all locations in each face image and the corresponding pixel values in the reconstructed face image are then obtained from the pixel matrix. Finally, the pixel values of all locations in the face image and the corresponding pixel values in the reconstructed face image are input into a loss function to obtain the first loss. The formula for calculating the first loss is as follows:
[0060]
[0061] in, It can represent the number of face images. , , These can represent the height, width, and number of channels of a face image, respectively. , , It can represent a position in a channel of an image. It can represent the first Zhang's facial image ( The pixel value at position ) It can represent the first Zhang reconstructs the face image ( The pixel value at the given location. It can be the L1 loss function, L2 loss function, cross-entropy loss function, Smooth L1 loss function, Huber L1 loss function, etc.
[0062] Optionally, there can be multiple face images, each corresponding to a reconstructed face image. Each face image and the reconstructed face image can include multiple key points. A second loss can be obtained based on the labels of the multiple face images and the first predicted labels of the corresponding reconstructed face images. The formula for calculating the second loss is as follows:
[0063]
[0064] in, K can represent the number of face images, and K can represent the total number of facial landmarks in an image. It can represent the first In the first face image The coordinates of the key points It can represent the first Zhang reconstructed the face image. Predicted coordinates of key points. It can be the L2 loss function, L1 loss function, cross-entropy loss function, Smooth L1 loss function, Huber L1 loss function, etc.
[0065] Optionally, the first loss and the second loss can be weighted and summed to obtain a joint loss. The initial visual encoder can then be trained based on this joint loss to obtain the final visual encoder. The formula for calculating the joint loss is as follows:
[0066]
[0067] in, and The weighting coefficients for the first loss and the second loss can be represented respectively. + =1, and The value can be preset or obtained based on training.
[0068] In this embodiment of the application, pixel-level loss ( ) and key point coordinate loss ( ) are fused, and the fusion loss is used as the basis for the result. Training the initial visual encoder improves its reliability.
[0069] S130: The initial facial landmark detection model is trained based on the face image and the visual encoder to obtain the facial landmark detection model.
[0070] One approach is to input the face image into a visual encoder and an initial facial landmark detection model, respectively, to obtain the corresponding output results of the visual encoder and the initial facial landmark detection model. The output results are then compared, and the initial facial landmark detection model is trained based on the comparison results to obtain a facial landmark detection model.
[0071] This embodiment provides a model generation method. After acquiring training data including face images, an initial visual encoder is trained based on the face images and a pre-trained 3D deformable face model to obtain a visual encoder. The 3D deformable face model is used to acquire the 3D information corresponding to the face image, and the visual encoder is used to acquire the 2D information of the face image. An initial facial landmark detection model is then trained based on the face image and the visual encoder to obtain a facial landmark detection model. This method allows the initial visual encoder to be trained using training data and a pre-trained 3D deformable face model, enabling it to learn both 2D and 3D information from the face image. The initial facial landmark detection model is then trained using the visual encoder, providing prior knowledge to the model. This allows the facial landmark detection model to learn the 2D and 3D information corresponding to the face image through the visual encoder, improving the accuracy of landmark detection.
[0072] Please see Figure 5 This application provides a model generation method applied to electronic devices, the method comprising:
[0073] S210: Obtain training data, which includes face images.
[0074] S220: The initial visual encoder is trained based on the face image and the pre-trained three-dimensional deformable face model to obtain a visual encoder. The three-dimensional deformable face model is used to obtain the three-dimensional information corresponding to the face image, and the visual encoder is used to obtain the two-dimensional information of the face image.
[0075] S230: Input the face image into the visual encoder to obtain a second representation vector corresponding to the face image, the second representation vector being used to represent the information corresponding to the face image.
[0076] In one approach, a face image can be input into a visual encoder obtained in step S220 to obtain a second representation vector corresponding to the face image.
[0077] S240: Input the face image into the initial face key point detection model to obtain the detection result.
[0078] The detection results may include a predicted representation vector and a second predicted label corresponding to the face image. The second predicted label may represent the predicted coordinates of the key points included in the face image obtained based on the initial face key point detection model.
[0079] In the embodiments of this application, such as Figure 6 As shown, the initial facial landmark detection model can include a skeletal network and a detection head. The skeletal network can be used to extract facial image features; the skeletal network can be MobileNet-V2, EfficientNet, etc. Since the skeletal network can be a cascaded structure, the output of each node can represent different features of the portrait image. To improve the accuracy of feature extraction in the initial facial landmark detection model, a detection head can be connected after the skeletal network. The detection head can be used to fuse the features output by multiple nodes of the skeletal network to obtain new features. These new features can contain richer semantic information, thereby improving the accuracy of the facial landmark detection model. The number of nodes selected by the detection head and the position of the selected nodes in the skeletal network can be limited.
[0080] One approach is to input a facial image into a skeletal network, enabling multiple nodes of the skeletal network to extract features from the facial image, and then input the extracted features into a detection head to obtain the detection results.
[0081] Optional, such as Figure 7 As shown, the face image can be input into the initial face key point detection model and the visual encoder obtained in step S220, respectively, to simultaneously obtain the predicted representation vector, the second predicted label, and the second representation vector corresponding to the face image.
[0082] S250: The initial facial landmark detection model is trained based on the label, the second representation vector, and the detection result to obtain the facial landmark detection model.
[0083] The system can contain multiple face images, each of which may include multiple keypoints. One approach is to first derive a third loss function based on the second representation vector, the predicted representation vector, the label, and the second predicted label. This third loss function ensures that the second predicted label and the label in the detection result are identical, and that the predicted representation vector in the detection result is identical to the second representation vector. Then, the initial face keypoint detection model is trained based on the third loss function to obtain the final face keypoint detection model. The formula for calculating the third loss function is as follows:
[0084]
[0085] in, K can represent the number of face images, and K can represent the total number of facial landmarks in an image. It can represent the first In the first face image The coordinates of the key points It can represent the first In the first face image Predicted coordinates of key points This can represent the second representation vector output by the visual encoder after the nth face image is processed. It can represent the predicted representation vector output by the initial face landmark detection model for the nth image. It can be the L1 loss function, L2 loss function, cross-entropy loss function, Smooth L1 loss function, Huber L1 loss function, etc.
[0086] This embodiment provides a model generation method that, through the aforementioned approach, first trains an initial visual encoder using training data and a pre-trained 3D deformable face model to obtain a visual encoder capable of learning 2D and 3D information of a face image. Then, the visual encoder is used to train an initial facial landmark detection model to obtain a facial landmark detection model. This allows the visual encoder to provide prior knowledge to the initial facial landmark detection model, enabling it to learn the corresponding 2D and 3D information of the face image, thus improving the accuracy of landmark detection. Furthermore, in this embodiment, during the training process of the initial facial landmark detection model, a second representation vector and labels can be used as prior knowledge (e.g., expression parameters, lighting model parameters, etc.). A third loss is used to enable the initial facial landmark detection model to learn based on this prior knowledge, thereby improving the accuracy of landmark detection for some difficult samples (e.g., samples with exaggerated expressions or poor lighting conditions).
[0087] Please see Figure 8 This application provides a method for detecting facial landmarks, the method comprising:
[0088] S310: Obtain the face image to be processed.
[0089] One approach is to acquire the face image to be processed using an image acquisition device (such as a camera).
[0090] S320: Input the face image into the face key point detection model obtained based on the above method to obtain the key point coordinates corresponding to the face image.
[0091] As one approach, the acquired face image can be input into a face landmark detection model obtained based on the above method to obtain the coordinates of the landmarks corresponding to the face image.
[0092] Optionally, after obtaining the coordinates of the key points corresponding to the face image, operations such as face recognition and 3D face reconstruction can be performed based on the obtained coordinates.
[0093] Optionally, the coordinates of key points corresponding to the face image can be obtained based on key point coordinate regression or heatmap regression.
[0094] This embodiment provides a method for detecting facial landmarks. By using the above method, the facial image to be processed can be input into a facial landmark detection model trained based on a pre-trained 3D deformable face model and a visual encoder. This allows the facial landmark detection model to obtain the 2D and 3D information corresponding to the facial image, thereby improving the accuracy of landmark detection.
[0095] Please see Figure 9 This application provides a model generation apparatus 600, the apparatus 600 comprising:
[0096] Training data acquisition unit 610 is used to acquire training data;
[0097] The visual encoder generation unit 620 is used to train the initial visual encoder based on the training data to obtain the three-dimensional deformable face model of the visual encoder for obtaining the three-dimensional information corresponding to the face image, and the visual encoder is used to obtain the two-dimensional information of the face image.
[0098] The model generation unit 630 is used to train the initial facial landmark detection model based on the training data and the visual encoder to obtain the facial landmark detection model.
[0099] In one approach, the training data also includes labels corresponding to the face image, the labels representing the true coordinates of key points included in the corresponding face image. The visual encoder generation unit 620 is specifically used to input the face image into the initial visual encoder to obtain a first representation vector corresponding to the face image, the first representation vector being used to represent the information corresponding to the face image; input the first representation vector into the pre-trained three-dimensional deformable face model to obtain a face triangle mesh; based on the face triangle mesh and the first representation vector, obtain a first predicted label and a reconstructed face image, the first predicted label representing the predicted coordinates of key points included in the face image obtained based on the three-dimensional deformable face model; and train the initial visual encoder based on the face image, the labels, the first predicted label, and the reconstructed face image to obtain a visual encoder.
[0100] Optionally, the first representation vector includes camera model parameters, and the visual encoder generation unit 620 is specifically used to obtain the first predicted label based on the face triangle mesh and the camera model parameters; and to obtain the reconstructed face image based on the face triangle mesh, the camera model parameters, the face identity parameters, the expression parameters, the texture parameters and the illumination model parameters.
[0101] Optionally, the visual encoder generation unit 620 is specifically used to obtain a first loss based on the pixel values at all locations in the face image and the pixel values at corresponding locations in the reconstructed face image, wherein the first loss is used to reduce the difference between the pixels in the face image and the pixels in the reconstructed face image; to obtain a second loss based on the label and the first predicted label, wherein the second loss is used to reduce the difference between the true coordinates and the predicted coordinates; and to train the initial visual encoder based on the first loss and the second loss to obtain a visual encoder.
[0102] In one approach, the training data also includes labels corresponding to the face image, the labels representing the true coordinates of key points included in the face image. The model generation unit 630 is specifically used to input the face image into the visual encoder to obtain a second representation vector corresponding to the face image, the second representation vector being used to represent the information corresponding to the face image; input the face image into the initial face key point detection model to obtain a detection result; and train the initial face key point detection model based on the labels, the second representation vector, and the detection result to obtain a face key point detection model.
[0103] Optionally, the detection result includes a predicted representation vector and a second predicted label corresponding to the face image. The second predicted label represents the predicted coordinates of key points included in the face image obtained based on the initial face key point detection model. The model generation unit 630 is specifically used to obtain a third loss function based on the second representation vector, the predicted representation vector, the label, and the second predicted label. The third loss function is used to make the second predicted label included in the detection result the same as the label and the predicted representation vector included in the detection result the same as the second representation vector. The initial face key point detection model is trained based on the third loss function to obtain a face key point detection model.
[0104] Please see Figure 10 This application provides a structure search device 800, which operates in an electronic device. The device 800 includes:
[0105] The face image acquisition unit 810 is used to acquire the face image to be processed;
[0106] The detection result generation unit 820 is used to input the face image into the face key point detection model obtained based on the above method to obtain the key point coordinates corresponding to the face image.
[0107] The following will combine Figure 10 This application describes an electronic device.
[0108] Please see Figure 10 Based on the aforementioned model generation method, facial landmark detection method, and apparatus, this application embodiment also provides another electronic device 100 capable of executing the aforementioned model generation method and facial landmark detection method. The electronic device 100 includes one or more (only one is shown in the figure) processors 102 and a memory 104 coupled together. The memory 104 stores programs capable of executing the contents of the aforementioned embodiments, and the processors 102 can execute the programs stored in the memory 104.
[0109] The processor 102 may include one or more processing cores. The processor 102 connects to various parts within the electronic device 100 using various interfaces and lines, and performs various functions and processes data by running or executing instructions, programs, code sets, or instruction sets stored in the memory 104, and by calling data stored in the memory 104. Optionally, the processor 102 may be implemented using at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), or Programmable Logic Array (PLA). The processor 102 may integrate one or a combination of several of the following: Central Processing Unit (CPU), Graphics Processing Unit (GPU), and modem. The CPU primarily handles the operating system, user interface, and applications; the GPU is responsible for rendering and drawing the displayed content; and the modem handles wireless communication. It is understood that the modem may also not be integrated into the processor 102 and may be implemented separately using a communication chip.
[0110] The memory 104 may include random access memory (RAM) or read-only memory (ROM). The memory 104 can be used to store instructions, programs, code, code sets, or instruction sets. The memory 104 may include a program storage area and a data storage area. The program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as touch functionality, sound playback functionality, face image playback functionality, etc.), and instructions for implementing the various method embodiments described below. The data storage area may also store data created by the terminal 100 during use (such as phonebook data, audio and video data, chat log data, etc.).
[0111] Please refer to Figure 11 This diagram illustrates a structural block diagram of a computer-readable storage medium provided in an embodiment of this application. The computer-readable storage medium 1000 stores program code that can be called by a processor to execute the methods described in the above method embodiments.
[0112] The computer-readable storage medium 1000 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read-Only Memory), EPROM, hard disk, or ROM. Optionally, the computer-readable storage medium 1000 includes a non-transitory computer-readable storage medium. The computer-readable storage medium 1000 has storage space for program code 1010 that performs any of the method steps described above. This program code can be read from or written to one or more computer program products. The program code 1010 may be compressed, for example, in a suitable form.
[0113] In summary, the model generation method, facial landmark detection method, apparatus, and electronic device provided in this application, after acquiring training data including facial images, train an initial visual encoder based on the facial images and a pre-trained 3D deformable face model to obtain a visual encoder. The 3D deformable face model is used to acquire the 3D information corresponding to the facial image, and the visual encoder is used to acquire the 2D information of the facial image. Based on the facial image and the visual encoder, an initial facial landmark detection model is trained to obtain a facial landmark detection model. This approach allows the initial visual encoder to be trained first using training data and a pre-trained 3D deformable face model, enabling it to learn both 2D and 3D information from facial images. Then, the initial facial landmark detection model is trained using the visual encoder to obtain a facial landmark detection model. This allows the visual encoder to provide prior knowledge to the initial facial landmark detection model, enabling the model to learn the 2D and 3D information corresponding to the facial image, thus improving the accuracy of landmark detection.
[0114] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.
Claims
1. A model generation method characterized by comprising: The method includes: Acquire training data, which includes face images and labels corresponding to the face images, wherein the labels represent the true coordinates of key points included in the corresponding face images; The face image is input into an initial visual encoder to obtain a first representation vector corresponding to the face image. The first representation vector is used to represent the information corresponding to the face image. The first representation vector is input into a pre-trained 3D deformable face model to obtain a face triangular mesh; Based on the face triangular mesh and the first representation vector, a first prediction label and a reconstructed face image are obtained. The first prediction label represents the predicted coordinates of key points included in the face image obtained based on the three-dimensional deformable face model. The initial visual encoder is trained based on the face image, the label corresponding to the face image, the first predicted label, and the reconstructed face image to obtain a visual encoder. The three-dimensional deformable face model is used to obtain the three-dimensional information corresponding to the face image, and the visual encoder is used to obtain the two-dimensional information corresponding to the face image. The initial facial landmark detection model is trained based on the facial image and the visual encoder to obtain the facial landmark detection model.
2. The method of claim 1, wherein, The first representation vector includes camera model parameters, face identity parameters, expression parameters, texture parameters, and illumination model parameters. The step of obtaining the first predicted label and reconstructing the face image based on the face triangular mesh and the first representation vector includes: Based on the face triangular mesh and the camera model parameters, the first predicted label is obtained; The reconstructed face image is obtained based on the face triangle mesh, the camera model parameters, the face identity parameters, the expression parameters, the texture parameters, and the lighting model parameters.
3. The method according to claim 1, characterized in that, The step of training the initial visual encoder based on the face image, the label corresponding to the face image, the first predicted label, and the reconstructed face image to obtain a visual encoder includes: Based on the pixel values at all locations in the face image and the corresponding pixel values in the reconstructed face image, a first loss is obtained. The first loss is used to reduce the difference between the pixels of the face image and the pixels of the reconstructed face image. Based on the label corresponding to the face image and the first predicted label, a second loss is obtained, which is used to reduce the difference between the true coordinates and the predicted coordinates; The initial visual encoder is trained based on the first loss and the second loss to obtain a visual encoder.
4. The method according to claim 1, characterized in that, The step of training an initial facial landmark detection model based on the face image and the visual encoder to obtain a facial landmark detection model includes: The face image is input into the visual encoder to obtain a second representation vector corresponding to the face image. The second representation vector is used to represent the information corresponding to the face image. The face image is input into the initial face landmark detection model to obtain the detection result; The initial facial landmark detection model is trained based on the label corresponding to the face image, the second representation vector, and the detection result to obtain the facial landmark detection model.
5. The method according to claim 4, characterized in that, The detection result includes a predicted representation vector and a second predicted label corresponding to the face image. The second predicted label represents the predicted coordinates of key points included in the face image obtained based on the initial face key point detection model. The process of training the initial face key point detection model based on the label corresponding to the face image, the second representation vector, and the detection result to obtain a face key point detection model includes: Based on the second representation vector, the predicted representation vector, the label corresponding to the face image, and the second predicted label, a third loss function is obtained. The third loss function is used to make the second predicted label included in the detection result the same as the label and the predicted representation vector included in the detection result the same as the second representation vector. The initial facial landmark detection model is trained based on the third loss function to obtain the facial landmark detection model.
6. A method for detecting facial landmarks, characterized in that, The method includes: Obtain the face image to be processed; The face image is input into a face key point detection model obtained based on any one of the methods described in claims 1-5 to obtain the key point coordinates corresponding to the face image.
7. A model generation apparatus, characterized in that, The device includes: The training data acquisition unit is used to acquire training data, which includes a face image and a label corresponding to the face image. The label represents the true coordinates of key points included in the corresponding face image. A visual encoder generation unit is configured to input the face image into an initial visual encoder to obtain a first representation vector corresponding to the face image, wherein the first representation vector is used to represent the information corresponding to the face image; input the first representation vector into a pre-trained three-dimensional deformable face model to obtain a face triangle mesh; based on the face triangle mesh and the first representation vector, obtain a first predicted label and a reconstructed face image, wherein the first predicted label represents the predicted coordinates of key points included in the face image obtained based on the three-dimensional deformable face model; train the initial visual encoder based on the face image, the label corresponding to the face image, the first predicted label, and the reconstructed face image to obtain a visual encoder, wherein the three-dimensional deformable face model is used to acquire the three-dimensional information corresponding to the face image, and the visual encoder is used to acquire the two-dimensional information of the face image; The model generation unit is used to train the initial facial landmark detection model based on the face image and the visual encoder to obtain the facial landmark detection model.
8. A facial landmark detection device, characterized in that, The device includes: A face image acquisition unit is used to acquire face images to be processed; The detection result generation unit is used to input the face image into a face key point detection model obtained based on any one of the methods described in claims 1-5, so as to obtain the key point coordinates corresponding to the face image.
9. An electronic device, characterized in that, Includes one or more processors and memory; One or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs being configured to perform the method of any one of claims 1-6.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores program code, wherein the method described in any one of claims 1-6 is executed when the program code is run.