Expression driving method, device, equipment and storage medium
By extracting features from the unoccluded areas of an occluded image and determining facial deformation parameters, the problem of facial occlusion affecting the expression driving of virtual characters is solved, and accurate expression driving is achieved under occlusion conditions.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING CENTURY TAL EDUCATION TECH CO LTD
- Filing Date
- 2022-11-18
- Publication Date
- 2026-06-19
AI Technical Summary
The issue of user face occlusion affects facial feature analysis, resulting in inaccurate expression-driven effects for virtual characters.
Facial features of the unoccluded areas are extracted from occluded images, facial deformation parameters are determined based on these features, and virtual character expressions are driven.
When the face is obscured, the system focuses on the features of the unobscured areas to accurately drive the virtual character's expressions, ensuring the effectiveness of the expression-driven function.
Smart Images

Figure CN115937933B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of computer vision technology, and in particular to an expression-driven method, apparatus, device, and storage medium. Background Technology
[0002] In various scenarios such as mobile games and virtual reality, it is often necessary to drive the virtual character's expression based on the user's facial expressions to achieve expression transfer, so that the user can understand the interaction effect in mobile games, virtual reality and other scenarios by watching the virtual character's expression.
[0003] Current methods for driving virtual character expressions typically involve analyzing facial features in a user's facial image and then using the results to drive the virtual character's expressions. However, users often experience facial occlusion when playing mobile games or engaging in virtual reality experiences. This occlusion affects the facial feature analysis process, leading to inaccurate results and ultimately impacting the effectiveness of driving virtual character expressions. Summary of the Invention
[0004] To address the aforementioned technical problems, this disclosure provides an expression-driven method, apparatus, device, and storage medium.
[0005] Firstly, this disclosure provides an expression-driven method, which includes:
[0006] Obtain the image of the face occlusion;
[0007] Extract facial features corresponding to the unoccluded areas from the occluded image, and determine the facial deformation parameters corresponding to the occluded image based on the facial features corresponding to the unoccluded areas.
[0008] Based on the facial deformation parameters, the facial expressions of the target virtual character are driven.
[0009] Secondly, this disclosure provides an expression-driving device, the device comprising:
[0010] The acquisition module is used to acquire images of faces that are occluded.
[0011] The determination module is used to extract facial features corresponding to the unoccluded area from the facial occlusion image, and determine the facial deformation parameters corresponding to the facial occlusion image based on the facial features corresponding to the unoccluded area.
[0012] The driving module is used to drive the facial expressions of the target virtual character based on the facial deformation parameters.
[0013] Thirdly, embodiments of this disclosure also provide an electronic device, the device comprising:
[0014] processor;
[0015] Memory, used to store executable instructions;
[0016] The processor is used to read executable instructions from memory and execute the executable instructions to implement the method provided in the first aspect above.
[0017] Fourthly, embodiments of this disclosure also provide a computer-readable storage medium having a computer program stored thereon, wherein the storage medium stores the computer program, and when the computer program is executed by a processor, the processor causes the processor to implement the method provided in the first aspect above.
[0018] The technical solution provided in this disclosure has the following advantages compared with the prior art:
[0019] This disclosure discloses an expression-driving method, apparatus, device, and storage medium that acquires a facial occlusion image; extracts facial features corresponding to the unoccluded areas from the facial occlusion image, and determines facial deformation parameters corresponding to the facial occlusion image based on the facial features corresponding to the unoccluded areas; and drives the expression of a target virtual character based on the facial deformation parameters. Through this method, when the face is occluded, only the facial features corresponding to the unoccluded areas of the facial occlusion image need to be focused on, and the facial deformation parameters for driving the expression of the target virtual character are obtained based on the facial features corresponding to the unoccluded areas. This achieves expression-driving for the target virtual character, avoiding the influence of occlusions in the facial image on the facial feature analysis process, and ensuring the expression-driving effect of the target virtual character is well achieved even when the face is occluded. Attached Figure Description
[0020] The accompanying drawings, which are incorporated in and form a part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure.
[0021] To more clearly illustrate the technical solutions in the embodiments of this disclosure or the prior art, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0022] Figure 1 A flowchart illustrating an expression-driven method provided in an embodiment of this disclosure;
[0023] Figure 2 A flowchart illustrating another expression-driven method provided in this embodiment of the present disclosure;
[0024] Figure 3A schematic diagram of the model structure of an image processing model provided in an embodiment of this disclosure;
[0025] Figure 4 A flowchart illustrating yet another expression-driven method provided in this disclosure embodiment;
[0026] Figure 5 A logical diagram illustrating a model training process provided in an embodiment of this disclosure;
[0027] Figure 6 This is a schematic diagram of the structure of an expression driving device provided in an embodiment of the present disclosure;
[0028] Figure 7 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this disclosure. Detailed Implementation
[0029] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.
[0030] It should be understood that the steps described in the method embodiments of this disclosure may be performed in different orders and / or in parallel. Furthermore, the method embodiments may include additional steps and / or omit the steps shown. The scope of this disclosure is not limited in this respect.
[0031] The term "comprising" and its variations as used herein are open-ended, meaning "including but not limited to". The term "based on" means "at least partially based on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Definitions of other terms will be given in the description below. It should be noted that the concepts of "first", "second", etc., used in this disclosure are only used to distinguish different devices, modules, or units, and are not intended to limit the order of functions performed by these devices, modules, or units or their interdependencies.
[0032] It should be noted that the terms "a" and "a plurality of" used in this disclosure are illustrative rather than restrictive, and those skilled in the art should understand that, unless otherwise expressly indicated in the context, they should be understood as "one or more".
[0033] The names of messages or information exchanged between multiple devices in the embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.
[0034] With the rise of the metaverse, virtual characters have become a very popular research area. Meanwhile, the face, as the most expressive part of the human body, carries a great deal of information about facial movements, expressions, and postures. Therefore, using users' facial expressions to drive the expressions of virtual characters has become a key research topic. However, users' faces are often obscured, which affects the effectiveness of driving virtual character expressions.
[0035] To solve the above problems, the following will combine... Figures 1 to 5 The facial expression-driven method provided in this disclosure will be described. In this disclosure, the facial expression-driven method can be executed by an electronic device or a server. The electronic device may include a mobile phone, tablet computer, desktop computer, laptop computer, or other device with communication capabilities. The server may be a cloud server or server cluster, or other device with storage and computing capabilities. It should be noted that the following embodiments use an electronic device as the execution subject for illustrative explanation.
[0036] Figure 1 A flowchart illustrating an expression-driven method provided in an embodiment of this disclosure is shown.
[0037] like Figure 1 As shown, the expression-driven method may include the following steps.
[0038] S110, Obtain the face occlusion image.
[0039] In this embodiment, in various scenarios such as mobile games and virtual reality, when it is necessary to use the user's facial expressions to drive virtual characters, and the user's face is obscured, the electronic device can capture facial occlusion images in real time through image acquisition devices such as cameras.
[0040] In this context, facial occlusion images refer to the images to be processed for facial expression analysis to drive the expressions of virtual characters. Facial occlusion images can be single facial images or consecutive video frames.
[0041] Optionally, the face-occluded image can be an image with the cheeks covered, an image with the facial features covered, or an image with the forehead covered.
[0042] Optionally, the facial expression in the face-occluded image can be any one of the following: smiling, laughing, fear, surprise, etc.; the facial movement can be any one of the following: pouting, grinning, closing eyes, sticking out tongue, frowning, raising eyebrows, etc.; and the facial posture can be any one of the following: facing forward, facing sideways, shaking head, etc.
[0043] S120. Extract facial features corresponding to the unoccluded areas from the occluded image, and determine the facial deformation parameters corresponding to the occluded image based on the facial features corresponding to the unoccluded areas.
[0044] In this embodiment, the unoccluded area can be extracted from the face occlusion image first, then the features of the unoccluded area can be extracted to obtain the facial features corresponding to the unoccluded area, and then the parameters of the facial features corresponding to the unoccluded area can be calculated to obtain the facial deformation parameters.
[0045] The unoccluded area refers to the facial features after removing any occlusions. Specifically, the facial features corresponding to this unoccluded area can be deep facial semantic features, or a combination of deep facial semantic features and shallow facial features such as color, texture, edges, and angles.
[0046] Among them, facial deformation parameters refer to the expression-driving parameters that drive the expressions of virtual characters. These parameters can couple facial shape and facial expressions.
[0047] In some embodiments, the process of extracting facial features corresponding to unoccluded regions can utilize a pre-trained image processing model to process the occluded facial image and obtain the facial features corresponding to the unoccluded regions. Specifically, the region extraction network of the pre-trained image processing model is used to extract the unoccluded regions from the occluded facial image, and then a feature extraction network is used to extract the corresponding facial features from the unoccluded regions.
[0048] In other embodiments, a preset segmentation algorithm can be used to segment the occluded facial image into occluded and unoccluded regions. Then, a feature extraction algorithm is used to extract features from the unoccluded regions to obtain the facial features corresponding to the unoccluded regions. Optionally, the preset segmentation algorithm may include, but is not limited to, threshold segmentation algorithms, watershed segmentation algorithms, etc. The feature extraction algorithm may include, but is not limited to, local binary pattern algorithms, first-order edge extraction algorithms, etc.
[0049] In some embodiments, the process of determining facial deformation parameters can utilize a parameter calculation network of a pre-trained image processing model to process facial features corresponding to the unoccluded areas and calculate facial deformation parameters.
[0050] In other embodiments, facial features corresponding to the unoccluded area are used as input data for a preset facial parameter calculation function, and facial deformation parameters are calculated using the preset facial parameter calculation function.
[0051] Therefore, for occluded images, we only need to focus on the non-occluded areas to obtain the facial deformation parameters that drive the virtual character's expression, avoiding the process of determining the facial deformation parameters affected by occlusions in the occluded areas, and also reducing the amount of computation.
[0052] S130. Based on facial deformation parameters, drive the facial expressions of the target virtual character.
[0053] Understandably, facial deformation parameters can be used as expression-driving parameters for the target virtual character, and these parameters can be used to drive the virtual character's expressions, thereby achieving expression transfer. When the facial occlusion image is a video frame image, different times correspond to different facial occlusion images. Therefore, the facial deformation parameters corresponding to each time moment are obtained, and the virtual character's expressions are driven based on these parameters, thus achieving a dynamic expression transfer effect.
[0054] The target virtual character can be any virtual object whose facial expression needs to be transferred.
[0055] Optionally, in this embodiment, S130 may specifically include the following steps:
[0056] The facial deformation parameters are used as driving coefficients for the character model corresponding to the target virtual character, and the facial expressions of the target virtual character are driven based on the driving coefficients.
[0057] Among them, the character model can be a virtual character model.
[0058] Specifically, the blendshape parameter in the facial deformation parameters can be used as the driving coefficient of the character model corresponding to the target virtual character, so that the facial deformation parameters can be used to drive the expression of the target virtual character and complete the expression transfer process.
[0059] This disclosure provides an expression-driven method that acquires a facial occlusion image; extracts facial features corresponding to the unoccluded areas from the occluded image, and determines facial deformation parameters corresponding to the occluded image based on these features; and drives the expression of a target virtual character based on these facial deformation parameters. By using this method, when the face is occluded, only the facial features corresponding to the unoccluded areas of the occluded image need to be considered, and facial deformation parameters for driving the expression of the target virtual character are obtained based on these features. This achieves expression-driven control of the target virtual character, avoiding the influence of occlusions in the facial image on the facial feature analysis process. It ensures effective expression driving even when the face is occluded, guaranteeing the expression-driven effect of the target virtual character.
[0060] In other scenarios, when obtaining a complete facial image, it is necessary to reconstruct a facial image free of occlusions, enabling further image analysis. For example, by analyzing a user's facial expressions in the reconstructed image, the user's experience in game scenarios, virtual reality, and other similar scenarios can be determined. Furthermore, by analyzing a user's facial features in the reconstructed image, their age group and preferences can be analyzed, further recommending suitable game modes or virtual display models to the user.
[0061] To meet the above requirements, after executing S120, the method further includes:
[0062] Facial structure is calculated based on facial deformation parameters to obtain a reconstructed facial image corresponding to the occluded facial image. The reconstructed facial image is an image that does not contain the occluder.
[0063] In some embodiments, a pre-trained image processing model can be used to calculate facial structure based on facial deformation parameters to obtain a facial reconstruction image corresponding to a facial occlusion image.
[0064] In other embodiments, facial deformation parameters can be used as input data for a 3D reconstruction algorithm, and the 3D reconstruction algorithm can be used to calculate the facial structure based on the facial deformation parameters to obtain a facial reconstruction image corresponding to the facial occlusion image.
[0065] Therefore, facial deformation parameters are used to perform 3D reconstruction to obtain facial reconstruction images that do not contain occlusions. The facial reconstruction images are then used to determine the user's experience in different scenarios and to recommend preferred modes to the user.
[0066] In another embodiment of this disclosure, a pre-trained image processing model is obtained, and facial features corresponding to the unoccluded region are extracted from the face occlusion image using the pre-trained image processing model. Facial deformation parameters corresponding to the face occlusion image are determined based on the facial features corresponding to the unoccluded region.
[0067] Figure 2 A flowchart illustrating another facial expression transfer method provided in an embodiment of this disclosure is shown.
[0068] like Figure 2 As shown, the expression transfer method may include the following steps.
[0069] S210, Obtain the face occlusion image.
[0070] S210 is similar to S110, and will not be described in detail here.
[0071] S220. Use a pre-trained image processing model to extract facial features corresponding to the unoccluded areas from the occluded image, and determine the facial deformation parameters corresponding to the occluded image based on the facial features corresponding to the unoccluded areas.
[0072] In this embodiment, optionally, the image processing model includes a base network, and the base network includes a shallow encoder, a self-attention feature extraction layer, a deep encoder, and a fully connected layer; correspondingly, S220 specifically includes the following steps:
[0073] Shallow features of facial occlusion images are extracted based on a shallow encoder;
[0074] Based on the self-attention feature extraction layer, shallow features are extracted to obtain the non-occluded region of the face occlusion image and determine the shallow features corresponding to the non-occluded region.
[0075] The shallow features corresponding to the non-occluded area are processed by the deep encoder to obtain the deep features, and the deep features are used as the facial features corresponding to the non-occluded area.
[0076] The facial features corresponding to the non-occluded region are processed using a fully connected network to obtain the facial deformation parameters corresponding to the occluded image.
[0077] Among them, shallow features are highly similar to features of face-occluded images. Optionally, shallow features include: color features, texture features, edge features, and angular features.
[0078] The self-attention feature extraction layer predicts the occluded and unoccluded regions of a face occlusion image based on shallow features. Occluded regions are marked as 1, and unoccluded regions are marked as 0. Further, the unoccluded regions are multiplied by the shallow features to obtain the corresponding shallow features for the unoccluded regions.
[0079] Among them, the deep features are not very close to the features of the face occlusion image. Specifically, they are semantic information, and the deep features are used as the facial features corresponding to the unoccluded areas.
[0080] The fully connected network comprises multiple fully connected layers, which are used to process facial features corresponding to the unoccluded areas to obtain facial deformation parameters. Optionally, the facial deformation parameters may include shape parameter α, texture parameter β, blendshape parameter γ, illumination parameter δ, and camera parameter p.
[0081] To facilitate understanding of the process of using image processing models to obtain facial features corresponding to unoccluded areas and to obtain facial deformation parameters. Figure 3 A schematic diagram of the image processing model structure is shown.
[0082] like Figure 3 As shown, the image processing model includes a base network 10, which comprises a shallow encoder 11, a self-attention feature extraction layer 12, a deep encoder 13, and a fully connected layer 14. Specifically, firstly, the shallow encoder 11 extracts shallow features from the occluded facial image; then, the shallow features are processed by the self-attention feature extraction layer 12 to obtain the unoccluded region; next, the shallow features are multiplied by the unoccluded region to obtain the shallow features corresponding to the unoccluded region; further, the shallow features corresponding to the unoccluded region are processed by the deep encoder 13 to obtain deep features, thereby obtaining the facial features corresponding to the unoccluded region; finally, the fully connected layer 14 splits into two branches, one of which is used to process the facial features corresponding to the unoccluded region to obtain facial deformation parameters including α, β, γ, δ, and p, optionally, p includes Euler angle r and displacement vector t.
[0083] S230: Based on facial deformation parameters, drive the facial expressions of the target virtual character.
[0084] It is understandable that after obtaining facial deformation parameters using a pre-trained image processing model, the facial deformation parameters are output, and the facial deformation parameters are used to drive the facial expressions of the target virtual character.
[0085] Therefore, the shallow encoder, self-attention feature extraction layer, deep encoder and fully connected layer contained in the basic network of the pre-trained image processing model can be used to extract the shallow features corresponding to the non-occluded area and further determine the facial deformation parameters, thus ensuring the accuracy of the expression driving of the target virtual character.
[0086] In other scenarios, when a complete facial image is desired, it is necessary to reconstruct a facial image that does not contain occlusions, so that further image analysis can be performed on the facial reconstructed image.
[0087] To meet the above requirements, after executing S220, the method further includes:
[0088] A 3D reconstruction network is used to calculate facial structure by facial deformation parameters, and the 3D facial structure corresponding to the occluded image is obtained.
[0089] The renderer fills the 3D structure of the face with color to obtain the facial reconstruction image corresponding to the facial occlusion image. The facial reconstruction image does not contain the occlusion.
[0090] Among them, the three-dimensional facial structure can be the facial mesh structure corresponding to the facial occlusion image.
[0091] Specifically, firstly, the 3D reconstruction network calculates the facial shape based on the facial deformation parameters α and γ, and simultaneously calculates the facial texture based on β. Then, the facial shape is standardized to obtain the facial normal vector, and a rotation matrix is calculated for r to obtain the rotation matrix. Next, a rigid transformation is performed on the facial shape, rotation matrix, and t to obtain the facial shape in the camera coordinate system, and color is calculated based on the facial texture, facial normal vector, and δ to obtain the facial color. Furthermore, based on the facial shape in the camera coordinate system, the triangular facets corresponding to the 3D reconstruction network, and the facial color, the 3D facial structure is obtained.
[0092] Optionally, the three-dimensional structure of the face can be determined in the following way:
[0093]
[0094] Among them, M base It represents the three-dimensional structure of the face, where S is the facial shape and T is the facial texture. It is the average shape of the face obtained from clustering. It is the average facial texture derived from clustering, s i It is the first i principal components of shape, where i is 120, e i It is the first i principal components of facial mixed deformation, i is 64, t i Represents the first i principal texture components, where i is 200, and α i It is a shape parameter, β i It is a texture parameter, γ i It is the blendshape parameter.
[0095] Specifically, the renderer can use a preset rendering function to fill the 3D structure of the face with color, thereby obtaining a reconstructed facial image corresponding to the occluded facial image.
[0096] To better understand the 3D reconstruction and rendering processes, please refer to [link / reference needed]. Figure 3 α, β, γ, δ and p can be input into the 3D reconstruction network (3dmm) 20 to obtain the 3D structure of the face, and then the renderer 30 is used to fill the 3D structure of the face with color to obtain the reconstructed face image.
[0097] Therefore, a 3D reconstruction network in a pre-trained image processing model can be used for 3D reconstruction and a renderer can be used for color rendering to obtain a facial reconstruction image that does not contain occlusions. This allows for further use of the facial reconstruction image to determine the user's experience in different scenarios and to recommend preferred modes to the user.
[0098] In other scenarios, after obtaining a reconstructed face image based on a face occlusion image, face recognition can also be performed on the reconstructed face image, thereby realizing face recognition based on a face occlusion image.
[0099] Specifically, the face recognition network 40 in the pre-trained image processing model can be used to perform facial feature recognition on the reconstructed face image to obtain the target facial features, which are then used for face recognition.
[0100] In yet another embodiment of this disclosure, the image processing model is trained in advance using training samples.
[0101] Figure 4 A flowchart illustrating another expression transfer method provided in an embodiment of this disclosure is shown.
[0102] like Figure 4 As shown, the expression transfer method may include the following steps.
[0103] S410. Obtain training sample pairs; wherein, the training sample pairs include a first face image and a second face image, the first face image is an original face sample image without occlusions, and the second face image is a face reconstruction sample image corresponding to the first face image.
[0104] In this embodiment, before model training, a single image containing multiple expressions, facial movements, and facial poses can be collected as a first facial image, or a series of video frame images containing multiple expressions, facial movements, and facial poses can be collected as a first facial image. The facial reconstruction sample image obtained after facial occlusion and reconstruction of the first facial image is used as a second facial image, and the first facial image and the second facial image constitute a training sample pair.
[0105] Optional facial expressions may include, but are not limited to, smiling, laughing, crying, being afraid, or being surprised; facial movements may include, but are not limited to, pouting, grinning, opening the mouth, closing the eyes, sticking out the tongue, frowning, or raising the eyebrows; facial postures may include, but are not limited to, facing forward, facing sideways, shaking the head from left to right, or shaking the head from right to left.
[0106] S420. Using the first and second facial images from the training sample pairs, determine the model loss function of the preset network model.
[0107] In this embodiment, before inputting multiple first facial images from the training samples into a preset network, occlusions of different shapes are randomly generated on different first facial images to obtain multiple facial occlusion sample images. For example, mask-shaped occlusions are generated at the nose and mouth positions in the first facial images, sunglasses-shaped occlusions are generated at the eye positions, or leaf-shaped occlusions are generated at the cheek positions. Then, the multiple facial occlusion sample images are input into the preset network model to obtain multiple facial reconstruction prediction images. Finally, a loss function is calculated based on the multiple facial reconstruction prediction images and the corresponding second sample images.
[0108] like Figure 5 As shown, the preset network model includes a basic network, a 3D reconstruction network, a renderer, and a face recognition network.
[0109] In some embodiments, the model loss function includes a pixel loss function; wherein the pixel loss function is calculated based on the facial occlusion sample image corresponding to the first facial image, the second facial image, and the facial mask sample image corresponding to the facial occlusion sample image, and the facial mask sample image is obtained by erasing occlusions from the facial occlusion sample image using the base network in the preset network model.
[0110] Optionally, the pixel loss function can be determined as follows:
[0111]
[0112] Among them, L photo It is a pixel loss function, where I is the face occlusion sample image, and V is the pixel loss function. I It is a facial mask sample image, I RE This is the second facial image.
[0113] In this embodiment, as Figure 5 As shown, L can be photo The calculation process can be roughly understood as being based on the first facial image and the second facial image.
[0114] In other embodiments, the model loss function includes a face recognition loss function; wherein the face recognition loss function is calculated based on a first face recognition feature corresponding to a first face image and a second face recognition feature corresponding to a second face image, the first face recognition feature is obtained by performing face recognition on the first face image using a face recognition network in a preset network model, and the second face recognition feature is obtained by performing face recognition on the second face image using a face recognition network in a preset network model.
[0115] Optionally, the face recognition loss function can be determined as follows:
[0116]
[0117] Among them, L face Here, f is the facial recognition loss function, and f is the first facial recognition feature. It is the second facial recognition feature.
[0118] In this embodiment, as Figure 5 As shown, L can be faceThe calculation process can be roughly understood as follows: the first facial recognition feature obtained by processing the first facial image through a face recognition network and the second facial recognition feature obtained by processing the second facial image through a face recognition network are used to calculate the result.
[0119] In some other embodiments, the model loss function includes a keypoint loss function; wherein the keypoint loss function is determined based on a first keypoint loss function and a second keypoint loss function; the first keypoint loss function is calculated from the two-dimensional keypoint features corresponding to the first facial image and the two-dimensional keypoint features of the ground truth annotation of the second facial image, the two-dimensional keypoint features being obtained by extracting keypoints from the facial occlusion sample image corresponding to the first facial image using the base network of the preset network model; the second keypoint loss function is calculated from the three-dimensional keypoint features corresponding to the first facial image and the three-dimensional keypoint features of the ground truth annotation of the second facial image, the three-dimensional keypoint features being obtained by extracting keypoints from the facial occlusion sample image corresponding to the first facial image using the base network and the three-dimensional reconstruction network in the preset network model.
[0120] Specifically, the first keypoint loss function and the second keypoint loss function are weighted and summed to obtain the keypoint loss function.
[0121] Optionally, the keypoint loss function can be determined as follows:
[0122]
[0123] Among them, in L lmk When it is the first keypoint loss function, p i These are the two-dimensional key point features corresponding to the first facial image. It is the two-dimensional key point feature of the real-world annotation of the second facial image; in L lmk When it is the second keypoint loss function, p i These are the 3D key point features corresponding to the first facial image. is the three-dimensional key point feature of the real-world annotation of the second facial image, where i is the key point number.
[0124] In this embodiment, as Figure 5 As shown, L can be lmk One part of it can be roughly understood as being calculated based on the first and second facial images, while the other part is calculated based on the two-dimensional key point features output by the basic network and the two-dimensional key point features of the ground truth annotation.
[0125] In some other embodiments, the model loss function includes a consistency loss function; wherein the consistency loss function is calculated from the facial deformation sample parameters corresponding to different first facial images and the average value of the facial deformation sample parameters corresponding to multiple different first facial images, and the facial deformation sample parameters are obtained by processing the first facial images using the base network in the preset network model.
[0126] Optionally, the consistency loss function can be determined as follows:
[0127]
[0128] Among them, L con It is the consistency loss function, where α is the facial deformation sample parameter corresponding to different first facial images. It is the average value of facial deformation sample parameters corresponding to multiple different first facial images.
[0129] In this embodiment, as Figure 5 As shown, L can be con It can be roughly understood as being calculated based on the first and second facial images.
[0130] In some other embodiments, the model loss function includes a mask loss function; wherein the mask loss function is calculated from a predicted face mask image corresponding to the first face image and a reference face mask image corresponding to the first face image, the predicted face mask image is obtained by the base network in the preset network model to remove occlusions from the first face image, and the reference face mask image is obtained by covering the first face image with occlusions and then removing the occlusions.
[0131] Specifically, the facial mask reference image can use a random algorithm to cover occluders on the first facial image and then erase the occluders.
[0132] Optionally, the mask loss function can be determined as follows:
[0133]
[0134] Among them, L mask It is the consistency loss function, where m is the predicted facial mask image corresponding to the first facial image. It is the facial mask reference image corresponding to the first facial image.
[0135] In this embodiment, as Figure 5 As shown, L can be mask Specifically, it can be understood as being calculated based on the predicted facial mask image and the reference facial mask image output by the base network.
[0136] S430. Adjust the sub-networks in the preset network model based on the model loss function, and take the preset network model corresponding to the model loss function being less than or equal to the preset loss function threshold as the image processing model.
[0137] In some embodiments, where the model loss function includes a pixel loss function, S430 specifically includes: adjusting the base network, the 3D reconstruction network, and the renderer in the preset network model based on the pixel loss function in the model loss function.
[0138] Therefore, by using a pixel loss function in the preset network model, the basic network of the preset neural network model can be made to have the function of erasing occluders from facial occluded images.
[0139] In some other embodiments, when the model loss function includes a face recognition loss function, S430 specifically includes: adjusting the face recognition network in the preset network model based on the face recognition loss function in the model loss function.
[0140] Therefore, by using a facial recognition loss function in the preset network model, the facial recognition network of the preset neural network model can be made to perform facial recognition on facial images.
[0141] In some other embodiments, when the model loss function includes a keypoint loss function, S430 specifically includes: adjusting the base network and the 3D reconstruction network in the preset network model based on the keypoint loss function in the model loss function.
[0142] Specifically, the keypoint loss function can be used to adjust the shallow and deep encoders in the base network, as well as the 3D reconstruction network.
[0143] Therefore, the key point loss function in the preset network model consists of two parts, which enables the encoder in the basic network of the preset network model to extract two-dimensional key points, and enables the three-dimensional reconstruction network in the preset network model to extract three-dimensional key points, thereby comprehensively extracting the key point features of the face.
[0144] In some other embodiments, when the model loss function includes a consistency loss function, S430 specifically includes: adjusting the base network in the preset network model based on the consistency loss function in the model loss function.
[0145] Therefore, by using a consistency loss function in the preset network model, the basic network of the preset neural network model can determine the facial deformation parameters, thereby enabling the virtual character's expression and image reconstruction based on the facial deformation parameters, and decoupling the face shape and facial expression.
[0146] In some other embodiments, where the model loss function includes a mask loss function, S430 specifically includes: adjusting the base network in the preset network model based on the mask loss function in the model loss function.
[0147] Specifically, the mask loss function can be used to adjust the self-attention feature extraction layer in the base network.
[0148] Therefore, by using a mask loss function in the preset network model, the self-attention feature extraction layer in the basic network of the preset neural network model has the function of occlusion erasure, thus avoiding the impact of occlusions in the image on the accuracy of analysis due to facial occlusion.
[0149] In summary, the image processing model calculates multiple loss functions during the training process and trains the model based on these loss functions. This enables the basic network of the image processing model to perform functions such as occlusion removal in facial occluded images, facial recognition, comprehensive extraction of key facial features, and image reconstruction by calculating facial deformation parameters that drive virtual character expressions. As a result, the image processing model can be deployed and applied in various scenarios.
[0150] S440, Obtain the face occlusion image.
[0151] S450. Use an image processing model to extract facial features corresponding to the unoccluded areas from the occluded image, and determine the facial deformation parameters corresponding to the occluded image based on the facial features corresponding to the unoccluded areas.
[0152] S260: Based on facial deformation parameters, drive the facial expressions of the target virtual character.
[0153] S440 to S460 are similar to S210 to S230, and will not be described in detail here.
[0154] This disclosure also provides an expression driving device for implementing the above-described expression driving method, which will be described below in conjunction with... Figure 6 The following explanation is provided. In this embodiment, the facial expression driving device can be an electronic device or a server. The electronic device can include devices with communication functions such as mobile phones, tablets, desktop computers, and laptops. The server can be a cloud server or server cluster, or other devices with storage and computing functions.
[0155] Figure 6 A schematic diagram of the structure of an expression driving device provided in an embodiment of this disclosure is shown.
[0156] like Figure 6 As shown, the expression driving device 600 may include:
[0157] The acquisition module 610 is used to acquire facial occlusion images;
[0158] The determining module 620 is used to extract facial features corresponding to the non-occluded area from the facial occlusion image, and determine facial deformation parameters corresponding to the facial occlusion image based on the facial features corresponding to the non-occluded area.
[0159] The driving module 630 is used to drive the facial expressions of the target virtual character based on the facial deformation parameters.
[0160] This disclosure provides an expression-driving device capable of acquiring a facial occlusion image; extracting facial features corresponding to the unoccluded areas from the occluded image; determining facial deformation parameters corresponding to the occluded image based on the facial features corresponding to the unoccluded areas; and driving the expression of a target virtual character based on the facial deformation parameters. Through this method, when the face is occluded, only the facial features corresponding to the unoccluded areas of the occluded image need to be focused on, and the facial deformation parameters for driving the expression of the target virtual character are obtained based on the facial features corresponding to the unoccluded areas. This achieves expression-driving for the target virtual character, avoiding the influence of occlusions in the facial image on the facial feature analysis process. It ensures that the expression of the target virtual character can be effectively driven even when the face is occluded, guaranteeing the expression-driving effect of the target virtual character.
[0161] In some optional embodiments, the determining module 620 is specifically used to extract facial features corresponding to the unoccluded region from the face occlusion image using a pre-trained image processing model, and to determine the facial deformation parameters corresponding to the face occlusion image based on the facial features corresponding to the unoccluded region.
[0162] In some optional embodiments, the image processing model includes a base network, and the base network includes a shallow encoder, a self-attention feature extraction layer, a deep encoder, and a fully connected layer; accordingly, the determining module 620 is specifically used to extract shallow features of the face occlusion image based on the shallow encoder.
[0163] Based on the self-attention feature extraction layer, feature extraction is performed on the shallow features to obtain the unoccluded area of the face occlusion image, and the shallow features corresponding to the unoccluded area are determined.
[0164] The deep encoder processes the shallow features corresponding to the unoccluded area to obtain the deep features, and uses the deep features as the facial features corresponding to the unoccluded area.
[0165] The facial features corresponding to the unoccluded region are processed based on the fully connected network to obtain the facial deformation parameters corresponding to the occluded facial image.
[0166] In some alternative embodiments, the device further includes:
[0167] The reconstruction module is used to calculate the facial structure by using the three-dimensional reconstruction network to obtain the three-dimensional facial structure corresponding to the facial occlusion image.
[0168] The rendering module is used to fill the three-dimensional facial structure with color based on the renderer to obtain a facial reconstruction image corresponding to the facial occlusion image, wherein the facial reconstruction image is an image without occlusion.
[0169] In some optional embodiments, the driving module 630 is specifically used to use the facial deformation parameters as driving coefficients of the character model corresponding to the target virtual character, and to drive the expression of the target virtual character based on the driving coefficients.
[0170] In some alternative embodiments, the device further includes:
[0171] The sample acquisition module is used to acquire training sample pairs; wherein, the training sample pairs include a first face image and a second face image, the first face image is an original face sample image without occlusions, and the second face image is a face reconstruction sample image corresponding to the first face image;
[0172] The determination module is used to determine the model loss function of the preset network model using the first and second facial images in the training sample pair;
[0173] The training module is used to adjust the sub-networks in the preset network model based on the model loss function, and to use the preset network model corresponding to the model loss function being less than or equal to a preset loss function threshold as the image processing model.
[0174] In some optional embodiments, the model loss function includes a pixel loss function; wherein the pixel loss function is calculated based on the facial occlusion sample image corresponding to the first facial image, the second facial image, and the facial mask sample image corresponding to the facial occlusion sample image, and the facial mask sample image is obtained by erasing occlusions from the facial occlusion sample image using the base network in the preset network model;
[0175] Accordingly, the training module is specifically used to adjust the base network, 3D reconstruction network, and renderer in the preset network model based on the pixel loss function in the model loss function.
[0176] In some optional embodiments, the model loss function includes: a face recognition loss function; wherein the face recognition loss function is calculated based on a first face recognition feature corresponding to the first face image and a second face recognition feature corresponding to the second face image, the first face recognition feature is obtained by performing face recognition on the first face image using the face recognition network in the preset network model, and the second face recognition feature is obtained by performing face recognition on the second face image using the face recognition network in the preset network model;
[0177] Accordingly, the training module is specifically used to adjust the face recognition network in the preset network model based on the face recognition loss function in the model loss function.
[0178] In some optional embodiments, the model loss function includes: a keypoint loss function; wherein the keypoint loss function is determined based on a first keypoint loss function and a second keypoint loss function; the first keypoint loss function is calculated from the two-dimensional keypoint features corresponding to the first facial image and the two-dimensional keypoint features of the ground truth annotation of the second facial image, the two-dimensional keypoint features being obtained by extracting keypoints from the facial occlusion sample image corresponding to the first facial image using the base network of the preset network model; the second keypoint loss function is calculated from the three-dimensional keypoint features corresponding to the first facial image and the three-dimensional keypoint features of the ground truth annotation of the second facial image, the three-dimensional keypoint features being obtained by extracting keypoints from the facial occlusion sample image corresponding to the first facial image using the base network and the three-dimensional reconstruction network in the preset network model;
[0179] Accordingly, the training module is specifically used to adjust the base network and the 3D reconstruction network in the preset network model based on the key point loss function in the model loss function.
[0180] In some optional embodiments, the model loss function includes: a consistency loss function; wherein the consistency loss function is calculated from the facial deformation sample parameters corresponding to different first facial images and the average of the facial deformation sample parameters corresponding to multiple different first facial images, and the facial deformation sample parameters are obtained by processing the first facial images using the base network in the preset network model;
[0181] Accordingly, the training module is specifically used to adjust the base network in the preset network model based on the consistency loss function in the model loss function.
[0182] In some optional embodiments, the model loss function includes a mask loss function; wherein the mask loss function is calculated from a predicted face mask image corresponding to the first face image and a reference face mask image corresponding to the first face image, the predicted face mask image is obtained by the base network in the preset network model to remove occlusions from the first face image, and the reference face mask image is obtained by covering the first face image with occlusions and then removing the occlusions;
[0183] Accordingly, the training module is specifically used to adjust the base network in the preset network model based on the mask loss function in the model loss function.
[0184] It should be noted that, Figure 6 The shown expression driver 600 can perform... Figures 1 to 5 The various steps in the method embodiment shown are implemented. Figures 1 to 5 The processes and effects in the method embodiments shown are not described in detail here.
[0185] Exemplary embodiments of this disclosure also provide an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor. The memory stores a computer program executable by the at least one processor, which, when executed by the at least one processor, causes the electronic device to perform a method according to an embodiment of this disclosure.
[0186] Exemplary embodiments of this disclosure also provide a non-transitory computer-readable storage medium storing a computer program, wherein the computer program, when executed by a computer's processor, is used to cause the computer to perform a method according to embodiments of this disclosure.
[0187] Exemplary embodiments of this disclosure also provide a computer program product, including a computer program, wherein the computer program, when executed by a computer's processor, is used to cause the computer to perform a method according to an embodiment of this disclosure.
[0188] refer to Figure 7The present invention describes a structural block diagram of an electronic device 700 that can serve as a server or client of the present disclosure. This is an example of a hardware device that can be applied to various aspects of the present disclosure, and the electronic device 700 can be the aforementioned electronic device. The electronic device is intended to represent various forms of digital electronic computer devices, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.
[0189] like Figure 7 As shown, the electronic device 700 includes a computing unit 701, which can perform various appropriate actions and processes based on a computer program stored in a read-only memory (ROM) 702 or a computer program loaded from a storage unit 708 into a random access memory (RAM) 703. The RAM 703 may also store various programs and data required for the operation of the electronic device 700. The computing unit 701, ROM 702, and RAM 703 are interconnected via a bus 704. An input / output (I / O) interface 705 is also connected to the bus 704.
[0190] Multiple components in electronic device 700 are connected to I / O interface 705, including: input unit 706, output unit 707, storage unit 708, and communication unit 709. Input unit 706 can be any type of device capable of inputting information to electronic device 700. Input unit 706 can receive input digital or character information and generate key signal inputs related to user settings and / or function control of electronic device. Output unit 707 can be any type of device capable of presenting information and may include, but is not limited to, a display, speaker, video / audio output terminal, vibrator, and / or printer. Storage unit 704 may include, but is not limited to, disk and optical disk. Communication unit 709 allows electronic device 700 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers, and / or chipsets, such as Bluetooth™ devices, WiFi devices, WiMax devices, cellular communication devices, and / or the like.
[0191] The computing unit 701 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the various methods and processes described above. For example, in some embodiments, the facial expression transfer method for a virtual character can be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program can be loaded and / or installed on the electronic device 700 via ROM 702 and / or communication unit 709. In some embodiments, the computing unit 701 can be configured to perform the facial expression transfer method for a virtual character by any other suitable means (e.g., by means of firmware).
[0192] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.
[0193] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0194] As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, device, and / or apparatus (e.g., disk, optical disk, memory, programmable logic device (PLD)) for providing machine instructions and / or data to a programmable processor, including machine-readable media that receive machine instructions as machine-readable signals. The term "machine-readable signal" refers to any signal for providing machine instructions and / or data to a programmable processor.
[0195] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).
[0196] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with embodiments of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.
[0197] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact through communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other.
[0198] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes the element.
[0199] The above are merely specific embodiments of this disclosure, enabling those skilled in the art to understand or implement this disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this disclosure. Therefore, this disclosure is not to be limited to these embodiments, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims
1. An expression-driven method, characterized in that, include: Obtain the image of the face occlusion; Extract facial features corresponding to the unoccluded areas from the occluded image, and determine the facial deformation parameters corresponding to the occluded image based on the facial features corresponding to the unoccluded areas. Based on the facial deformation parameters, the facial expressions of the target virtual character are driven; The step of extracting facial features corresponding to the unoccluded areas from the occluded image and determining facial deformation parameters corresponding to the occluded image based on the facial features corresponding to the unoccluded areas includes: A pre-trained image processing model is used to extract facial features corresponding to the unoccluded regions from the occluded facial image, and facial deformation parameters corresponding to the occluded facial image are determined based on the facial features corresponding to the unoccluded regions. The image processing model includes a base network, which includes a shallow encoder, a self-attention feature extraction layer, a deep encoder, and a fully connected network. Accordingly, the step of extracting facial features corresponding to the unoccluded regions from the occluded image using a pre-trained image processing model, and determining the facial deformation parameters corresponding to the occluded image based on the facial features corresponding to the unoccluded regions, includes: The shallow features of the facial occlusion image are extracted based on the shallow encoder. Based on the self-attention feature extraction layer, feature extraction is performed on the shallow features to obtain the non-occluded area of the face occlusion image. The non-occluded area is multiplied with the shallow features to obtain the shallow features corresponding to the non-occluded area. The deep encoder processes the shallow features corresponding to the unoccluded area to obtain deep features, and uses the deep features as the facial features corresponding to the unoccluded area. The facial features corresponding to the unoccluded region are processed based on the fully connected network to obtain the facial deformation parameters corresponding to the occluded facial image.
2. The method according to claim 1, characterized in that, The image processing model includes a 3D reconstruction network and a renderer, and the method further includes: The facial structure is calculated by using the three-dimensional reconstruction network to obtain the three-dimensional facial structure corresponding to the facial occlusion image; The renderer fills the three-dimensional facial structure with color to obtain a reconstructed facial image corresponding to the occluded facial image, wherein the reconstructed facial image does not contain the occluder.
3. The method according to any one of claims 1 to 2, characterized in that, The process of driving the facial expressions of the target virtual character based on the facial deformation parameters includes: The facial deformation parameters are used as driving coefficients for the character model corresponding to the target virtual character, and the facial expressions of the target virtual character are driven based on the driving coefficients.
4. The method according to claim 1, characterized in that, The method further includes: Obtain training sample pairs; wherein, the training sample pairs include a first face image and a second face image, the first face image is an original face sample image without occlusions, and the second face image is a face reconstruction sample image corresponding to the first face image; Using the first and second facial images from the training sample pair, the model loss function of the preset network model is determined; The sub-networks in the preset network model are adjusted based on the model loss function, and the preset network model corresponding to when the model loss function is less than or equal to the preset loss function threshold is used as the image processing model.
5. The method according to claim 4, characterized in that, The model loss function includes a pixel loss function; wherein the pixel loss function is calculated based on the facial occlusion sample image corresponding to the first facial image, the second facial image, and the facial mask sample image corresponding to the facial occlusion sample image, and the facial mask sample image is obtained by erasing occlusions from the facial occlusion sample image using the basic network in the preset network model; Accordingly, adjusting the sub-networks in the preset network model based on the model loss function includes: Based on the pixel loss function in the model loss function, the base network, 3D reconstruction network, and renderer in the preset network model are adjusted.
6. The method according to claim 4, characterized in that, The model loss function includes a face recognition loss function; wherein the face recognition loss function is calculated based on a first face recognition feature corresponding to the first face image and a second face recognition feature corresponding to the second face image, the first face recognition feature is obtained by performing face recognition on the first face image using the face recognition network in the preset network model, and the second face recognition feature is obtained by performing face recognition on the second face image using the face recognition network in the preset network model; Accordingly, adjusting the sub-networks in the preset network model based on the model loss function includes: Based on the face recognition loss function in the model loss function, adjust the face recognition network in the preset network model.
7. The method according to claim 4, characterized in that, The model loss function includes: Keypoint loss function; wherein, the keypoint loss function is determined based on a first keypoint loss function and a second keypoint loss function; the first keypoint loss function is calculated from the two-dimensional keypoint features corresponding to the first facial image and the two-dimensional keypoint features of the ground truth annotation of the second facial image, the two-dimensional keypoint features are obtained by extracting keypoints from the facial occlusion sample image corresponding to the first facial image using the base network of the preset network model; the second keypoint loss function is calculated from the three-dimensional keypoint features corresponding to the first facial image and the three-dimensional keypoint features of the ground truth annotation of the second facial image, the three-dimensional keypoint features are obtained by extracting keypoints from the facial occlusion sample image corresponding to the first facial image using the base network and the three-dimensional reconstruction network in the preset network model. Accordingly, adjusting the sub-networks in the preset network model based on the model loss function includes: Based on the key point loss function in the model loss function, adjust the base network and 3D reconstruction network in the preset network model.
8. The method according to claim 4, characterized in that, The model loss function includes: Consistency loss function; wherein, the consistency loss function is calculated by the average of the facial deformation sample parameters corresponding to different first facial images and the facial deformation sample parameters corresponding to multiple different first facial images, and the facial deformation sample parameters are obtained by processing the first facial images using the basic network in the preset network model; Accordingly, adjusting the sub-networks in the preset network model based on the model loss function includes: Based on the consistency loss function in the model loss function, the base network in the preset network model is adjusted.
9. The method according to claim 4, characterized in that, The model loss function includes: Mask loss function; wherein, the mask loss function is calculated from the facial mask prediction image corresponding to the first facial image and the facial mask reference image corresponding to the first facial image, the facial mask prediction image is obtained by the base network in the preset network model to erase the occlusions of the first facial image, and the facial mask reference image is obtained by covering the first facial image with occlusions and then erasing the occlusions; Accordingly, adjusting the sub-networks in the preset network model based on the model loss function includes: Based on the mask loss function in the model loss function, the base network in the preset network model is adjusted.
10. An expression-driven device, characterized in that, include: The acquisition module is used to acquire images of faces that are occluded. The determination module is used to extract facial features corresponding to the unoccluded area from the facial occlusion image, and determine the facial deformation parameters corresponding to the facial occlusion image based on the facial features corresponding to the unoccluded area. A driving module is used to drive the facial expressions of the target virtual character based on the facial deformation parameters; The step of extracting facial features corresponding to the unoccluded areas from the occluded image and determining facial deformation parameters corresponding to the occluded image based on the facial features corresponding to the unoccluded areas includes: A pre-trained image processing model is used to extract facial features corresponding to the unoccluded regions from the occluded facial image, and facial deformation parameters corresponding to the occluded facial image are determined based on the facial features corresponding to the unoccluded regions. The image processing model includes a base network, which includes a shallow encoder, a self-attention feature extraction layer, a deep encoder, and a fully connected network. Accordingly, the step of extracting facial features corresponding to the unoccluded regions from the occluded image using a pre-trained image processing model, and determining the facial deformation parameters corresponding to the occluded image based on the facial features corresponding to the unoccluded regions, includes: The shallow features of the facial occlusion image are extracted based on the shallow encoder. Based on the self-attention feature extraction layer, feature extraction is performed on the shallow features to obtain the non-occluded area of the face occlusion image. The non-occluded area is multiplied with the shallow features to obtain the shallow features corresponding to the non-occluded area. The deep encoder processes the shallow features corresponding to the unoccluded area to obtain deep features, and uses the deep features as the facial features corresponding to the unoccluded area. The facial features corresponding to the unoccluded region are processed based on the fully connected network to obtain the facial deformation parameters corresponding to the occluded facial image.
11. An electronic device, characterized in that, include: processor; Memory, used to store executable instructions; The processor is configured to read the executable instructions from the memory and execute the executable instructions to implement the method of any one of claims 1-9.
12. A computer-readable storage medium having a computer program stored thereon, characterized in that, The storage medium stores a computer program that, when executed by a processor, causes the processor to implement the method described in any one of claims 1-9.
Citation Information
Patent Citations
Three-dimensional face generation method and device and three-dimensional face replaying method and device
CN114898034A
Two-dimensional to three-dimensional facial expression migration method, electronic device and storage medium
CN114926581A
Single-image face three-dimensional reconstruction method based on self-alignment double regression
CN114972619A