A face normal estimation method based on global and local information fusion

By combining CNN and Transformer networks, a face normal estimation method that fuses global and local information is constructed, which solves the problems of complex network architecture, high data requirements and insufficient information fusion in existing technologies. This method achieves more accurate and stable normal estimation, especially for efficient face image reconstruction under complex lighting conditions.

CN118692122BActive Publication Date: 2026-06-23TIANJIN POLYTECHNIC UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TIANJIN POLYTECHNIC UNIV
Filing Date
2024-04-03
Publication Date
2026-06-23

Smart Images

  • Figure CN118692122B_ABST
    Figure CN118692122B_ABST
Patent Text Reader

Abstract

The application discloses a face normal estimation method based on global and local information fusion, which is realized by inputting a face image into a trained face normal estimation model; the face normal estimation model is composed of a face feature extraction and coding module E f , a global coding module E t and a decoding module D n , an innovative face normal estimation network structure is built by combining CNN and Transformer, so that the characteristics of capturing the fine structure and texture information of the face by the CNN model which is good at extracting the local features of the image and the powerful global information processing capability of the Transformer model which can capture the long-distance dependence in the image are combined, the effective extraction and fusion of the global and local information are realized, the model can also make accurate responses to the fine structure and texture changes of the face, and finally, the accurate estimation of the face normal is realized, and the accuracy and stability of the face image normal estimation are improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of face modeling technology, and in particular to a face normal estimation method based on the fusion of global and local information. Background Technology

[0002] Face normal estimation aims to reconstruct a 3D face from a given face image. In recent years, it has attracted considerable attention due to its broad application potential in downstream tasks such as face lighting repositioning and face editing. In the field of face modeling, face normal estimation is a crucial step, significantly impacting subsequent tasks such as rendering and recognition. However, due to the complexity and diversity of face structures, traditional normal estimation methods often fail to achieve satisfactory results.

[0003] Currently, Convolutional Neural Networks (CNNs) dominate the field of face normal estimation. Standard CNN models typically employ an encoder-decoder architecture, where the encoder learns feature representations, and the decoder predicts these features at the pixel level. The feature representation learning of the encoder is the most crucial aspect, and different network structures are designed to enhance feature representation. George Trigeorgis et al. and Tero Karras et al. focused on using 3D face models (3DMMs) to construct synthetic paired data to optimize the performance of CNNs on real-world face shape estimation tasks. However, the strategies employed in these studies did not adequately consider the inherent differences between synthetic and real-world data, which may lead to weakened model generalization ability and insufficient performance. To effectively bridge this gap between real-world and simulated data, Sengupta et al. innovatively combined real-world datasets with synthetic datasets and trained their models using deep learning frameworks, aiming to improve the model's robustness across different data sources. Meanwhile, Abrevaya et al. designed a network architecture with selectively skipped connections, which allows for flexible adaptation to paired and unpaired training data, thereby enhancing the model's adaptability under diverse input conditions. Wang et al. further proposed a two-stage learning paradigm to generate high-quality face representation standards through an instance-based learning mechanism. While the above methods improve the overall accuracy of facial normal estimation to some extent, their ability to capture and reconstruct facial micro-details still needs improvement.

[0004] It is evident that although significant progress has been made in face normal estimation based on CNN models, it still faces the following limitations: 1) Complex network architecture, meaning that improving the model's feature capacity and overall performance often relies on a carefully designed and complex network structure; 2) High data requirements and generalization ability requirements, meaning that enhancing the model's generalization ability requires a large amount of training data, highlighting the importance of large-scale datasets; 3) Local region focus problem, meaning that due to the convolutional characteristics of CNNs, they mainly focus on local regions when processing images, limiting the effective modeling of long-distance dependencies, which can lead to inconsistent or inaccurate face normal estimation results from a global perspective.

[0005] Transformer network models are currently widely used in semantic segmentation, image denoising, and image restoration, excelling at capturing broad global contextual information, though they may suffer from the problem of losing local associations. However, using Transformer networks to encode image block tokens and directly upsampling hidden feature representations to a full-resolution dense output does not yield satisfactory results. The main reason for this is the inherent nature of the Transformer, treating its input as a one-dimensional sequence and focusing on capturing global contextual information. Therefore, there have been no reports of using Transformer networks for face normal estimation, as the generated low-resolution features lack the fine-grained localization information required for accurate face normal estimation. Summary of the Invention

[0006] The purpose of this invention is to provide a face normal estimation method based on the fusion of global and local information to solve the above-mentioned problems in the prior art.

[0007] Therefore, the technical solution of the present invention is as follows:

[0008] A face normal estimation method based on the fusion of global and local information is proposed, which is implemented by inputting face image values ​​into a trained face normal estimation model; the face normal estimation model consists of a face feature extraction and encoding module E. f Global Encoding Module E t and decoding module D n Composition; among which,

[0009] Face feature extraction and encoding module E f A pre-trained ResNet18 network model is used, which consists of sequentially connected... Module, Module, Module, Modules and Module composition;

[0010] Global Encoding Module E tConsists of sequentially connected overlapping image patch embedding modules, Module, Modules and Module composition; the overlapping image patch embedding module is a 3×3 convolutional module; Module, Modules and The module consists of three Transformer modules, each of which is composed of three TransBlocks modules connected in sequence.

[0011] Decoding module D n Connected in sequence Module, Module, Module, Module, Modules and Module composition; Module, Module, Module, Modules and Each module consists of a 3×3 deconvolution layer, a BN layer, and a ReLU activation function connected in sequence; The module consists of a 3×3 convolutional layer;

[0012] The module's output is also connected to Connect the module's input terminal. The module's output is also connected to Connect the module's input terminal. The module's output is also connected to Connect the module's input terminal. Module output, The module's output and The output of the module is connected to the input of the Concat module, and the output of the Concat module is connected to... Connect the module's input.

[0013] Furthermore, in training the face normal estimation model, the loss function L for training the network model is... otal Set to:

[0014] L otal =L recon +λ adv L adv +λ tv L TV ,

[0015] In the formula, L recon The reconstruction loss function for the normal map. N represents the face normal predicted by the network model. gt To train labeled face normal maps in the dataset; L adv To counteract the loss function, D adv This represents the discriminator during model training; LTV is the TV loss function. λ represents the estimated face normal, p and q are the pixel coordinates in the image, and W and H are the width and height of the image, respectively; adv and λ tv This is the balance coefficient of the loss function.

[0016] Furthermore, λ adv Set to 0.0001, λ tv Set to 0.01.

[0017] Furthermore, the learning rate was set to 0.0001 during the training of the face normal estimation model.

[0018] Compared with existing technologies, this face normal estimation method based on the fusion of global and local information combines Transformer and CNN to build an innovative face normal estimation network model. It leverages the CNN model's ability to extract local features of the image and capture the fine structure and texture information of the face, while utilizing the Transformer model's powerful global information processing capability to capture long-distance dependencies in the image, providing global contextual information for normal estimation. This achieves effective extraction and fusion of global and local information, improving the accuracy and stability of face image normal estimation. Attached Figure Description

[0019] Figure 1 This is a flowchart of the face normal estimation method based on the fusion of global and local information of the present invention;

[0020] Figure 2 The image shows a comparison of the face normal estimation results obtained by the method of this invention and the current state-of-the-art method HFFNE, as well as the shadows generated under different lighting conditions.

[0021] Figure 3 This is a comparison chart showing the angle error of the face normals generated using the method of this invention compared to the current state-of-the-art methods HFFNE and CM.

[0022] Figure 4 This is a comparison of the face normal estimation results obtained by the method of this invention with the current state-of-the-art methods HFFNE and CM based on the FFHQ dataset in terms of fine structure.

[0023] Figure 5The image shows a comparison of the normal estimation results obtained by the method of this invention and the current state-of-the-art method HFFNE based on the ICT-3DRFE dataset, the generation of shadows under different lighting conditions, and the re-rendered face images.

[0024] Figure 6 The image shows a comparison of the face normal estimation results obtained using the method of this invention and the current state-of-the-art method HFFNE based on the CelebA dataset, as well as the shadows under different lighting conditions. Detailed Implementation

[0025] The present invention will be further described below with reference to the accompanying drawings and specific embodiments, but the following embodiments are by no means intended to limit the present invention.

[0026] See Figure 1 The implementation steps of this face normal estimation method based on the fusion of global and local information are as follows:

[0027] S1. Construct a face normal estimation model, which consists of the face feature extraction and encoding module E. f Global Encoding Module E t and decoding module D n constitute;

[0028] The specific implementation steps of step S1 are as follows:

[0029] S101. Construct a CNN-based face feature extraction and encoding module E f ;

[0030] Face feature extraction and encoding module E f A pre-trained ResNet18 network model is used, the specific structure of which can be found in: Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv: 2010.11929, 2020; therefore, this face feature extraction encoding module E f Specifically, it consists of sequentially connected Module, Module, Module, Modules and The module consists of five convolutional layers, which extract multi-scale feature maps layer by layer to effectively represent information about the face at different scales.

[0031] Specifically, given a face image I∈R as input H×W×C Input to face feature extraction and encoding module E f In this method, a five-layer convolutional module structure is used to extract multi-scale feature maps layer by layer, so as to effectively represent the information of the face at different scales through feature maps. The feature maps are represented as follows: The size of the first layer feature map is . The size of the second layer feature map is Following this pattern, the final size of the fifth layer feature map is... This processing strategy fully leverages the advantages of transfer learning, significantly reducing the time and computational resources required for training from scratch, thus improving research efficiency. This facial feature extraction and encoding module E... f The obtained multi-scale features with rich information provide a solid foundation for subsequent face normal estimation tasks.

[0032] S102. Construct a global encoding module E based on Transformer. t To effectively capture key long-range dependency features in facial images;

[0033] Since CNNs are designed for local feature extraction, focusing on capturing fine-grained details within finite regions of the input image, it is necessary to construct a global encoding module E for jointly capturing long-range image relationships in order to avoid localization methods from easily neglecting fundamental aspects crucial for high-quality face normal recovery, such as global facial structure and the overall distribution of normals on the face. t This enhances the model's understanding of the global context.

[0034] Global Encoding Module E t Consists of sequentially connected overlapping image patch embedding modules, Module, Modules and The module consists of modules; among them, the OverlapPatchEmbed module is a 3×3 convolutional module. Module, Modules and The module consists of three Transformer modules with the same structure, each of which is composed of three TransBlocks modules connected in sequence. For the specific structure of each TransBlocks module, please refer to: Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang, “Restormer: Efficient transformer for high-resolution image restoration,” in CVPR, 2022, pp. 5728-5739.

[0035] Specifically, given a face image I∈R as input H×W×C Input to global encoding module E t In the process, by downsampling the image three times consecutively, the image becomes more detailed and accurate. The feature size extracted by the module is reduced to 1 / 8 of the original image size to ensure an appropriate balance between computational efficiency and preservation of critical global facial background.

[0036] S103, Constructing the decoding module D for face normal estimation n ;

[0037] During the face normal estimation process, the decoding module D n It is necessary to adjust the global encoding module E. t The obtained global features and face feature extraction encoding module E f Accurate face normal estimation results can only be obtained by fusing the extracted local features.

[0038] Decoding module D n Connected in sequence Module, Module, Module, Module, Modules and Module composition; among which, Module, Module, Module, Modules and The modules have the same structure, and each module consists of a 3×3 deconvolution layer, a BN layer and a ReLU activation function connected in sequence. The module consists of a 3×3 convolutional layer.

[0039] S104. Construct a face normal estimation model;

[0040] The face normal estimation model includes a face feature extraction and encoding module E. f Global Encoding Module E t Decoding module D n and the Concat module; among which, The module's output is also connected to The module's input is connected to, so as to... The module's output information is input to Module; The module's output is also connected to The module's input is connected to, so as to... The module's output information is input to Module; The module's output is also connected to The module's input is connected to, so as to... The module's output information is input to Module; Module output, The module's output and The output of the module is connected to the input of the Concat module to... Module, Modules and The module's output information is input to the Concat module for connection operations; the output of the Concat module is connected to... The module's input is connected to receive the connection operation result output by the Concat module. Module.

[0041] Specifically, the face normal estimation model uses the face feature extraction and encoding module E f Local features of the face image are captured. As the network structure deepens, the size of the feature map gradually decreases until a key node is reached, where the feature map size is one-eighth of the original dimension. The output of fusing this key point with global features is then used for global feature fusion, specifically through a concatenation operation, represented as follows: By combining global background information with local detail features, the model is given a broader and more comprehensive perspective, enabling it to estimate the normals of the face more accurately.

[0042] S2. Train the face normal estimation model so that it can accurately predict the face normal results;

[0043] The specific implementation steps of step S2 are as follows:

[0044] The face normal estimation model is developed based on the Ubuntu 22.04 operating system environment using the PyTorch deep learning framework and trained using an NVIDIA GeForce RTX 2080Ti GPU. During the training phase, the Photoface dataset, which contains a rich collection of face images and their corresponding normals, is used as the training samples to train the face normal estimation model.

[0045] During training, the learning rate parameter is set to 0.0001, which is an important hyperparameter controlling the magnitude of weight updates in the optimization algorithm. Using a smaller learning rate helps the model converge stably to the global or local optimum. The loss function L required for training the network model is set. otal for:

[0046] L otal =L recon +λ adv L adv +λ tv L TV

[0047] In the formula, L recon L is the reconstruction loss function for the normal map, which is used to constrain the training of the network model; adv To counteract the loss function, it is used to effectively capture high-frequency details of the normals during face normal estimation, avoiding the low-frequency normals that would result from using only cosine loss; L TV λ is the TV loss function, which is used to maintain the integrity of facial structure and the clarity of geometry by effectively reducing noise. This loss function method not only helps to preserve the realism of facial details but also minimizes the generation of artifacts. adv and λ tv , where is the balancing coefficient of the loss function, used to adjust the influence of each loss function on the overall loss function to ensure model performance; where ,

[0048] Reconstruction loss function L recon The expression is:

[0049]

[0050] In the formula, N represents the face normal predicted by the network model. gt To train labeled face normal maps in the dataset;

[0051] Adversarial loss function L adv The expression is:

[0052]

[0053] In the formula, Dadv This represents the discriminator during the model training process.

[0054] TV loss function L TV The expression is:

[0055]

[0056] In the formula, This represents the estimated face normal, where p and q are the pixel coordinates in the image, and W and H are the width and height of the image, respectively.

[0057] In this embodiment, the balance coefficient λ of the loss function adv =0.0001 and λ tv =0.01; when the loss function L otal The numerical gradient decreases to its minimum and remains unchanged. Specifically, the face normal estimation model is trained for 30,000 iterations. During this process, the model gradually learns and improves its ability to predict normal information in face images. When the loss function reaches the preset convergence criterion or no longer decreases significantly, the model training phase can be considered complete, at which point the model's performance has reached its optimal state under the existing training conditions.

[0058] Furthermore, in order to verify the effectiveness of the face normal estimation method based on the fusion of global and local information in this application for face normal estimation in face images, a series of new face image data are used as input to a fully trained face normal estimation model. The face normal estimation model is used to automatically extract key features in the image and then output accurate estimation results of face normals.

[0059] Meanwhile, as a comparative example, this verification experiment used the publicly disclosed HFFNE and CM-E methods in existing state-of-the-art methods to predict face normals on the same face image, and compared the output estimation results with the prediction results obtained using the method of this application. For the specific structure of the HFFNE method, please refer to: Meng Wang, Chaoyue Wang, Xiaojie Guo, and Jiawan Zhang, “Towards high-fidelity face normal estimation,” in ACMMM, 2022, pp. 5172-5180; for the specific structure of the CM-E method, please refer to: Victoria Fernandez Abrevaya, Adnane Boukhayma, Philip HS Torr, and Edmond Boyer, “Cross-modal deep face normals with deactivable skip connections,” in CVPR, 2020, pp. 4979-4989.

[0060] like Figure 2 The image shows a comparison of shadows under different lighting conditions using the method of this application and the HFFNE method; where Input represents the input image, which is specifically derived from high-resolution face images in the database, and is omitted in the image to avoid infringement. Figures 1-6 (Same); Normal represents the face normal estimation result, and S1 to S4 represent the shadow images generated under different lighting conditions based on the face normal estimation result, which facilitates a direct comparison of the differences in generation effects between the face normal estimation results. From Figure 2 The comparison results between the images show that, compared with the current state-of-the-art normal estimation technique, namely the High-Precision Normal Estimation Method (HFFNE), the method of this application (OURS) demonstrates a significant advantage in preserving surface details. This is especially evident in a series of shadow images (S1 to S4) generated under different lighting conditions. By comparing and observing these images, it is clear that the method of this application has a better generation effect on facial details (such as beards, wrinkles, etc.), proving that the present invention can still provide a more refined normal estimation effect when dealing with complex lighting changes.

[0061] like Figure 3 The image shows a comparison of the angle errors of face normals generated using the method described in this application, the HFFNE method, and the CM method. In error map analysis, the intensity of color is often used to visually reflect the degree of error or the level of accuracy. Figure 3The blue color represents the error, with darker blue indicating more accurate estimation; this visualization helps verify and quantify the improvement of this invention in enhancing the accuracy of normal estimation or other related fields. Based on this, in Figure 3 The method of this application is compared with the CM method and the HFFNE method. The error distribution area of ​​the estimation result obtained by the method of this application is a deeper blue, which intuitively shows that the estimation result of the present invention is more accurate when processing the corresponding task. That is, compared with the CM and HFFNE methods, the method of this application produces less error in the estimation process and is closer to the true value.

[0062] like Figure 4 The figure shows a comparison of the normal estimation results on the FFHQ dataset using the method of this application, the HFFNE method, and the CM method. For details of the FFHQ dataset, please refer to: Tero Karras, Samuli Laine, and Timo Aila, “Astyle-based generator architecture for generative adversarial networks,” in CVPR, 2019, pp. 4401-4410. In the evaluation based on the real-world face dataset FFHQ, the method of this application shows a significant advantage over CM and HFFNE, especially in the preservation of details in the eye region. As shown in the figure, the method of this application exhibits higher fidelity in processing eye details, which confirms its excellent generalization ability and robustness in the extraction and reconstruction of complex and critical facial features. In other words, regardless of various lighting conditions, expression changes, or significant individual differences, the method of this application can stably capture and reconstruct the fine structure of the eyes, thus demonstrating its efficiency and reliability in practical applications.

[0063] like Figure 5The image shows a comparison of normals, shadows, and re-rendered faces on the ICT-3DRFE dataset using the method of this application and the HFFNE method. For details of the ICT-3DRFE dataset, please refer to: Wan-Chun Ma, Tim Hawkins, Pieter Peers, Charles-Felix Chabert, Malte Weiss, Paul E Debevec, et al., “Rapid acquisition of specular and diffuse normal maps from polarized spherical gradient illumination.,” Rendering Techniques, vol. 2007, no. 9, pp. 10, 2007. In evaluating face rendering performance on the ICT-3DRFE dataset with albedo (i.e., face images without illumination effects), the method of this application demonstrates a significant improvement in normal estimation when re-rendering face images under new lighting conditions. Specifically, compared to the existing HFFNE technique, the method of this application can effectively suppress and reduce the generation of artificial artifacts when handling illumination changes, thereby achieving a more natural and realistic re-illumination output for face images. from Figure 5 The comparison results, especially the part within the rectangle shown in the figure, show that the rendered image produced by the HFFNE method has obvious artificial artifacts, while the processing result using the method of this application significantly reduces the occurrence of such unnatural phenomena and preserves better details. This is shown in the Relit column of the figure, which shows the face image after lighting rendering, thus confirming the advantages of the method of this application in maintaining visual quality and detail continuity.

[0064] like Figure 6The image shows a comparison of normal maps and lighting / shading maps on the CelebA dataset using the method described in this application and the HFFNE method. For details of the CelebA dataset, please refer to: Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep learning face attributes in the wild. In ICCV. 3730–3738. Evaluation experiments conducted on the CelebA real-world face dataset clearly verify the effectiveness of the method described in this invention. Specifically, the comparison clearly shows that when applied to various lighting conditions (spherical harmonics) in the CelebA dataset, the normal estimation technique proposed in this invention can accurately capture and reproduce changes in the geometric details of the face surface in the rendered shadow map. This means that regardless of changes in the light source, the method described in this application can accurately estimate the normal direction of each point on the face surface, thereby generating a more realistic and three-dimensional lighting effect, further improving the visual fidelity and recognition accuracy of face images in complex lighting environments.

[0065] In summary, see Figure 1 As can be seen, the method of this application clearly demonstrates how the present invention effectively combines the powerful ability of convolutional neural networks (CNNs) to capture local feature details with the unique advantages of Transformer networks in modeling global dependencies, thereby achieving a collaborative architecture. The method of this application, by introducing global information, enables the face normal estimation model to capture the overall structure and shape features of the face, providing macroscopic guidance for normal estimation. Meanwhile, the fine capture of local information ensures that the model can also accurately respond to subtle structural and textural changes in the face, ultimately achieving accurate estimation of face normals.

[0066] Combination such as Figures 2-6 The comparative results presented demonstrate that this architecture aims to optimize the performance of normal estimation tasks. The shadow image samples S1 to S4 generated under various lighting scenarios provide strong evidence, visually showcasing the superior performance of the proposed method in simultaneously maintaining surface geometric details and overall consistency. By integrating the core mechanisms of CNN and Transformer, this invention successfully constructs a novel solution that can meticulously capture subtle local differences while comprehensively considering global contextual information, thereby improving the overall accuracy and robustness of the normal estimation task.

Claims

1. A face normal estimation method based on the fusion of global and local information, characterized in that, It is implemented by training a face normal estimation model using face image input values; the face normal estimation model consists of a face feature extraction and encoding module E. f Global Encoding Module E t and decoding module D n constitute; in, Face feature extraction and encoding module E f A pre-trained ResNet18 network model is used, which consists of sequentially connected... Module, Module, Module, Modules and Module composition; Global Encoding Module E t Consists of sequentially connected overlapping image patch embedding modules, Module, Modules and Module composition; the overlapping image patch embedding module is a 3×3 convolutional module; Module, Modules and The module consists of three Transformer modules, each of which is composed of three TransBlocks modules connected in sequence. Decoding module D n Connected in sequence Module, Module, Module, Module, Modules and Module composition; Module, Module, Module, Modules and Each module consists of a 3×3 deconvolution layer, a BN layer, and a ReLU activation function connected in sequence; The module consists of a 3×3 convolutional layer; The module's output is also connected to Connect the module's input terminal. The module's output is also connected to Connect the module's input terminal. The module's output is also connected to Connect the module's input terminal. Module output, The module's output and The output of the module is connected to the input of the Concat module, and the output of the Concat module is connected to... Connect the module's input.

2. The face normal estimation method based on global and local information fusion according to claim 1, characterized in that, In training a face normal estimation model, the loss function L for network model training is... otal Set to: L otal =L recon +λ adv L adv +λ tv L TV , In the formula, L recon The reconstruction loss function for the normal map. N represents the face normal predicted by the network model. gt To train labeled face normal maps in the dataset; L adv To counteract the loss function, D adv L represents the discriminator during model training. TV For TV loss function, λ represents the estimated face normal, p and q are the pixel coordinates in the image, and W and H are the width and height of the image, respectively; adv and λ tv This is the balance coefficient of the loss function.

3. The face normal estimation method based on global and local information fusion according to claim 2, characterized in that, λ adv Set to 0.0001, λ tv Set to 0.

01.

4. The face normal estimation method based on global and local information fusion according to claim 2, characterized in that, The learning rate was set to 0.0001 during the training of the face normal estimation model.