A new-view hand image synthesis method based on a generative model

By combining normal map estimation networks, diffusion models, and generative adversarial networks, the computational cost and artifact problems in novel human hand image synthesis are solved, achieving realistic image synthesis effects.

CN117475019BActive Publication Date: 2026-06-23SOUTHEAST UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SOUTHEAST UNIV
Filing Date
2023-11-03
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing technologies struggle to synthesize realistic new-perspective images of human hands while keeping computational costs acceptable, especially given the high risk of artifacts or distortions caused by self-occlusion and complex joints of the human hand.

Method used

A normal map estimation network combined with a diffusion model and a generative adversarial network (GAN) is used to synthesize realistic new perspective images of human hands through end-to-end training and cascading. The normal map estimation network is used to estimate the normal map of the target image, the diffusion model is used to synthesize low-resolution images, and the GAN is used to improve image resolution and quality.

Benefits of technology

It achieves consistency in the structure and texture of the human hand across different viewpoints, reduces computational overhead and time costs, and provides satisfactory compositing results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117475019B_ABST
    Figure CN117475019B_ABST
Patent Text Reader

Abstract

The application provides a new perspective hand image synthesis method based on a generative model, which is used for synthesizing a hand image of other perspectives from an input single perspective image. The method first prepares paired training data, then designs a normal map estimation network for estimating a normal map, and estimates a normal map corresponding to a target image; then, a network based on a diffusion model is pre-trained for synthesizing a low-resolution new perspective hand image; a super-resolution module based on a generative adversarial network is pre-trained, so that the module has the ability to improve the resolution and quality of the low-resolution image; finally, the diffusion model and the generative adversarial network are jointly trained, and the generative adversarial network is cascaded to the diffusion model; through the joint training of the two different modules, the two different modules can be applied to the hand image synthesis task. The application only needs to input a single single perspective hand image, and a series of realistic new perspectives can be obtained, which is helpful for promoting multi-perspective based three-dimensional reconstruction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision, and more specifically, to a novel method for synthesizing human hand images based on a generative model. Background Technology

[0002] Synthesizing dense new viewpoint images from monocular images is crucial for reducing the reconstruction cost of digital humans, especially in NeRF-based multi-view reconstruction tasks, where the need for synthesizing realistic new viewpoint images is particularly prominent. Most existing viewpoint synthesis methods focus on symmetrical or rigid objects, and these methods often perform poorly on multi-jointed objects such as the human body and hands. In particular, the indistinguishable texture of the human hand limits the expressive power of features extracted from the input image. Furthermore, the human hand is more flexible than the human body, with higher degrees of freedom in its joints. This high complexity inevitably leads to self-occlusion of the hand, increasing the risk of artifacts or distortions in the synthesized image.

[0003] Generally, human hand image synthesis is essentially a generative task, which is typically implemented using two main methods: Generative Adversarial Networks (GANs) and diffusion models. The former completes the entire synthesis process through a single forward inference step, offering fast sampling, but this single-step approach is unsuitable for the complex joints of the human hand. The recently popular diffusion model synthesizes images using a series of forward-backward processes, following an iterative denoising approach. While the feasibility of diffusion models has been demonstrated in human image synthesis, both computational and time costs are prohibitive when using them for image synthesis, especially for super-resolution images. Therefore, effectively combining GANs and diffusion models to synthesize realistic new perspective images of human hands while maintaining acceptable computational costs is a pressing problem in this field. Summary of the Invention

[0004] The main problem that the invention aims to solve is to synthesize realistic new-view human hand images from single-view input images using a novel method, while ensuring that the human hand structure and appearance texture are consistent across different viewpoints.

[0005] To address the aforementioned technical problems, this invention proposes a novel method for synthesizing human hand images based on a generative model. The technical solution includes the following steps:

[0006] Step 1: Prepare paired training data, which includes input images and target images;

[0007] Step 2: Design a normal map estimation network for estimating normal maps, and use the paired training data prepared in Step 1 to estimate the normal map corresponding to the target image using the designed normal map estimation network. The normal map estimation network is trained in an end-to-end manner and will be used as a normal map estimator to be applied offline to the entire process of view synthesis.

[0008] Step 3: Pre-train a network based on a diffusion model to synthesize low-resolution new perspective images of a human hand: Use the normal map corresponding to the target image estimated in Step 2 as a condition, and use the diffusion model to synthesize the target perspective based on the given input image. In order to ensure the efficiency of perspective synthesis, this network is used to synthesize low-resolution images.

[0009] Step 4: Pre-train a super-resolution module based on a generative adversarial network, enabling the module to improve the resolution and quality of low-resolution images.

[0010] Step 5: Jointly train the diffusion model and the generative adversarial network, and cascade the generative adversarial network after the diffusion model: By jointly training the two different modules mentioned above, it can be applied to the human hand image synthesis task. Specifically, for the low-resolution image obtained in step 3, it is sent to the super-resolution module pre-trained in step 4, thereby improving the resolution and quality of the synthesized low-resolution image and finally obtaining the desired target image.

[0011] Furthermore, the preparation of paired training data described in step 1 involves organizing and selecting paired data from the open-source multi-view hand datasets Interhand2.6M and Hand4K, including input images and target images.

[0012] Furthermore, the normal map estimation network described in step 2 consists of an encoder and a decoder, used to estimate the normal map corresponding to the target image. For feature maps with the same scale between the encoder and decoder, residual connections are used for concatenation. Both the encoder and decoder consist of 5 residual blocks, and LeakyReLU is used as the activation function after each layer. The designed normal map estimation network has low-resolution images as input and output, i.e., 64×64, and the training process is supervised by the following loss function:

[0013]

[0014] Among them, L nor This represents the loss function used to estimate the network by training the normal map. This represents the predicted normal map; the superscript L indicates a low-resolution image. Represents the truth value of the normal map.

[0015] Furthermore, the diffusion model used in step 3 has a generation process that includes two parts: a noise addition process and a noise removal process. The noise addition process refers to adding Gaussian noise to the image, while the noise removal process is to gradually remove noise in an iterative manner to synthesize the target image.

[0016] Further, step 3 specifically involves using the normal map estimated by the normal map estimation network, which is consistent with the target image, as one of the conditions for the generation process; adding Gaussian noise to the target image to obtain a noise map; extracting the corresponding feature information from the input images in the paired training data using a feature encoder network, and using this as another condition in the generation process; then, using the two conditions obtained above, iteratively denoising the noise map through a UNet-structured neural network to synthesize a low-resolution new perspective human hand image; this process can be supervised by the following loss function:

[0017]

[0018] in, This represents the conditions required during the synthesis process, and the predicted normal map. Used to control the consistency of human hand structure across viewpoints, the input is a low-resolution image x. L Used to control the consistency of the appearance of a human hand across different viewpoints; "Energy function" is a general term used here to indicate the training process of the diffusion model; "t" represents the step size; "w" represents the step size. t The weights are defined by the time step and set to 1; ∈ represents noise; and ∈ φ This indicates a noise predictor, i.e., a UNet-structured network used in the denoising process; The image represents a noisy image; L represents a low-resolution image, i.e., 64×64.

[0019] Specifically, the feature information extracted from the input image is applied to the self-attention modules with resolutions of 16 and 8 in UNet. Let the output of the previous layer of UNet be F, and the feature information obtained after the input image passes through the feature encoder network be F. m The output of the self-attention layer is represented by the following formula:

[0020] Q = Con(Nor(F)), K = Con(F) m ), V = Con(F m )

[0021] F o =Con(softmax(F) attn )V)+F

[0022] Where Q represents the query value in the self-attention layer, K represents the key value in the self-attention layer, V represents the input feature value in the self-attention layer, and F... attn F represents the intermediate result of the self-attention layer. o This represents the output of the self-attention layer, where C represents a constant and is set to 0. softmax(·) represents the softmax function, Con(·) represents a 1D convolutional layer, Nor(·) represents a normalization layer, and GroupNorm is selected.

[0023] Further, step 4 specifically involves designing a super-resolution module based on a generative adversarial network (GAN). This module consists of a generator and a discriminator. The generator comprises convolutional layers with residual connections, using InstanceNorm2d and ReLU as normalization and activation functions, respectively. The discriminator also consists of convolutional layers, but uses LeakyReLU as the activation function. A sigmoid layer is concatenated in the last layer of the discriminator to predict a probability between 0 and 1, used to distinguish whether the input image is real or synthetic. Additionally, the designed super-resolution module is conditional on the predicted normal map, which helps correct unreasonable images during the synthesis process and allows the module to be cascaded with the diffusion model using the predicted normal map. The designed GAN is supervised using the following loss function:

[0024]

[0025] in, and y H These represent the normal map corresponding to the target image and the high-resolution target image, respectively. A normal plot representing the prediction; This represents a low-resolution image synthesized using a diffusion model. This represents a generator; while Indicates the discriminator; This indicates that the loss function is used to train the generator and discriminator; E is a general term for energy functions, where... and Both are used to indicate the training process of the discriminator, but the former is used to discriminate the true result, while the latter is used to discriminate the synthesized result;

[0026] In addition to the loss function mentioned above, the reconstruction loss function is also used to supervise this process, meaning that the synthesized result of the generator must remain consistent with the true value. in This represents the loss function used in this process.

[0027] Compared with the prior art, the present invention has the following advantages and beneficial effects:

[0028] 1) The normal map is used as a condition in the synthesis process and applied to the task of synthesizing human hand images from new perspectives. The generative network based on the diffusion model built in this invention can synthesize realistic human hand images from new perspectives under given conditions, and the structure and appearance texture of the human hand can remain consistent across different perspectives;

[0029] 2) This invention designs a super-resolution module based on generative adversarial networks to further improve the quality of images synthesized by the diffusion model. This module also uses normal maps as conditions, thus being cascaded with the diffusion model. This not only helps to improve the quality of the synthesized image, but also helps to save computational overhead and time costs.

[0030] 3) To the best of our knowledge, the novel human hand image synthesis method based on a generative model proposed in this invention is the first framework for human hand image synthesis and can achieve satisfactory synthesis results. Attached Figure Description

[0031] Figure 1 This is a flowchart of a novel perspective human hand image synthesis method in an embodiment of the present invention.

[0032] Figure 2 This is a structural diagram of the normal map estimation network designed in this embodiment of the invention.

[0033] Figure 3 This is a structural diagram of a low-resolution image synthesis network based on a diffusion model in an embodiment of the present invention.

[0034] Figure 4 This is a structural diagram of the super-resolution module based on generative adversarial networks in an embodiment of the present invention.

[0035] Figure 5 This is the final neural network structure diagram constructed in the embodiments of the present invention.

[0036] Figure 6 This is a novel perspective on human hand image synthesis that can be achieved through embodiments of the present invention.

[0037] Figure 7 This is the super-resolution human hand image synthesis effect that can be achieved by embodiments of the present invention. Detailed Implementation

[0038] The embodiments of the present invention will be described in detail below with reference to the accompanying drawings and examples, thereby summarizing the methods used and the effects achieved by the present invention, so that users can have a clearer understanding of the present invention. It is worth noting that, where there is no conflict, the features of the embodiments of the present invention can be combined with each other, and the resulting technical solutions are all within the protection scope of the present invention.

[0039] Furthermore, the flowchart shown in the accompanying drawings can be executed in a computer as a series of consecutive instructions, and the order of the flow can be modified appropriately in some cases.

[0040] Example

[0041] Figure 1 The flowchart below shows the novel perspective hand image synthesis method based on generative models mentioned in this embodiment of the invention. Figure 1 Provide a detailed explanation of each step.

[0042] Step 1: Prepare paired training data, which includes input and target images. Paired data, including input and target images, are selected from the open-source multi-view hand datasets Interhand2.6M and Hand4K. It's important to note that for the Interhand2.6M dataset, some of the multi-view images are grayscale. To ensure consistency between viewpoints, these grayscale images are discarded, retaining only the color images. For the Hand4K dataset, which contains not only single-hand images but also a small number of hand-object interaction images, these interaction images are retained during the preparation of the paired training data and used in the training process to verify the generative and generalization capabilities of this invention.

[0043] Step 2: Design a normal map estimation network for estimating the normal map, and train it using the pairwise training data prepared in Step 1 to obtain a normal map consistent with the target viewpoint. The network structure of the normal map estimation network is as follows: Figure 2 As shown, the normal map estimation network consists of an encoder and a decoder, trained end-to-end. Both the encoder and decoder are composed of 5 residual blocks, with LeakyReLU used as the activation function after each layer. The designed normal map estimation network uses low-resolution images (64×64) as both input and output. The trained normal map estimation network has its weights fixed and is used offline as a normal map estimator throughout the viewpoint synthesis process. To promote feature fusion, feature maps with the same scale between the encoder and decoder are concatenated using residual connections, which positively contributes to training a more robust normal map estimation network. The following loss function is used for supervision during the training of the normal map estimation network:

[0044]

[0045] Among them, L nor This represents the loss function used to estimate the network by training the normal map. This represents the predicted normal map; the superscript L indicates a low-resolution image. Represents the truth value of the normal map.

[0046] Step 3: Pre-train a diffusion model-based network to synthesize low-resolution new perspective hand images. The normal map estimated by the normal map estimation network in Step 2 is used as one of the conditions for perspective synthesis to promote consistency of hand structure in the synthesized image. To ensure consistency of hand appearance texture between perspectives, the input image is also used as an additional condition. For the target image in the paired training data, Gaussian noise is added to the target image to obtain a noise map; for the input image in the paired training data, corresponding feature information is extracted through a feature encoder network and used as another condition in the generation process; next, the noise map and the estimated normal map are concatenated and fed into a UNet-structured neural network to iteratively denoise the noise map, thereby synthesizing a low-resolution new perspective hand image. During this process, the extracted feature information is added to the self-attention layer specified in the UNet structure to ensure that the synthesized image and the input image have the same texture. In this invention, the extracted feature information is applied to self-attention modules with resolutions of 16 and 8. The process of synthesizing low-resolution hand images using the diffusion model is as follows: Figure 3 As shown, the training of this model can be supervised using the following loss function:

[0047]

[0048] in, This represents the conditions required during the synthesis process, and the predicted normal map. Used to control the consistency of human hand structure across viewpoints, the input is a low-resolution image x. L Used to control the consistency of the appearance of a human hand across different viewpoints; "Energy function" is a general term used here to indicate the training process of the diffusion model; "t" represents the step size; "w" represents the step size. t The weights are defined by the time step and set to 1; ∈ represents noise; and ∈ φ This indicates a noise predictor, i.e., a UNet-structured network used in the denoising process; represents a noisy image; L represents a low-resolution image, i.e., 64×64.

[0049] Let F be the output of the previous layer of UNet, and F be the feature information obtained after the input image passes through the feature encoder network. m The output of the self-attention layer is represented by the following formula:

[0050] Q = Con(Nor(F)), K = Con(F) m ), V = Con(F m )

[0051] F o =Con(softmax(F) attn )V)+F

[0052] Where Q represents the query value in the self-attention layer, K represents the key value in the self-attention layer, V represents the input feature value in the self-attention layer, and F... attn F represents the intermediate result of the self-attention layer. o This represents the output of the self-attention layer, where C represents a constant and is set to 0. softmax(·) represents the softmax function, Con(·) represents a 1D convolutional layer, Nor(·) represents a normalization layer, and GroupNorm is selected.

[0053] Step 4: Pre-train a super-resolution module based on a generative adversarial network, enabling the module to improve the resolution and quality of low-resolution images. Figure 4 The structure of the generative adversarial network (GAN) is shown. This module consists of a generator and a discriminator. The generator comprises convolutional layers with residual connections, using InstanceNorm2d and ReLU as normalization and activation functions, respectively. The discriminator also consists of convolutional layers, but uses LeakyReLU as the activation function. A sigmoid layer is cascaded in the last layer of the discriminator to predict a probability between 0 and 1, used to distinguish whether the input image is real or synthetic. Furthermore, the designed super-resolution module also uses the predicted normal map as a condition. This helps correct unreasonable images during the synthesis process and allows the super-resolution module to be cascaded with the diffusion model using the predicted normal map, saving computational overhead and time costs to some extent. The designed GAN is supervised by the following loss function:

[0054]

[0055] in, and y H These represent the normal map corresponding to the target image and the high-resolution target image, respectively. A normal plot representing the prediction; This represents a low-resolution image synthesized using a diffusion model. This represents a generator; while Indicates the discriminator; This indicates that the loss function is used to train the generator and discriminator; E is a general term for energy functions, where... and Both are used to indicate the training process of the discriminator, but the former is used to identify the true result, while the latter is used to identify the synthesized result.

[0056] In addition to the loss function mentioned above, the reconstruction loss function is also used to supervise this process, meaning that the synthesized result of the generator must remain consistent with the true value. in This represents the loss function used in this process.

[0057] Step 5: Jointly train the diffusion model and the generative adversarial network, and then cascade the generative adversarial network after the diffusion model. Figure 5 The final network structure diagram is shown. It can be seen that, using the normal map predicted in step 2 as a condition, the pre-trained diffusion model and generative adversarial network in steps 3 and 4 are cascaded together. By further jointly training these two different modules, the low-resolution image obtained in step 3 can have its resolution and quality significantly improved after passing through the super-resolution module, and finally the desired target image can be obtained.

[0058] The experimental results corresponding to this embodiment are as follows: Figure 6 and Figure 7 As shown. Among them, Figure 6 The effect of synthesizing a new perspective image of a human hand (256×256) from an input low-resolution image (64×64) is demonstrated. Figure 6 The first line represents the input single-view human hand image, the second line represents the normal map used to guide the synthesis process, the third line shows the effect of the synthesized image of the present invention, and the fourth line represents the ground truth of the target view. Figure 7 This demonstrates the performance of the present invention in the task of synthesizing super-resolution human hand images. Figure 7 The first two lines show the synthesized 1024×1024 human hand image, the third line shows the synthesized 512×512 human hand image, and the fourth line shows the synthesized 256×256 human hand image.

[0059] The technical means disclosed in this invention are not limited to those disclosed in the above embodiments, but also include technical solutions composed of any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principles of this invention, and these improvements and modifications are also considered within the scope of protection of this invention.

Claims

1. A novel method for synthesizing human hand images based on a generative model, characterized in that: Includes the following steps: Step 1: Prepare paired training data, which includes input images and target images; Step 2: Design a normal map estimation network for estimating normal maps, and use the paired training data prepared in Step 1 to estimate the normal map corresponding to the target image using the designed normal map estimation network. The normal map estimation network is trained in an end-to-end manner and will be used as a normal map estimator to be applied offline to the entire process of view synthesis. Step 3: Pre-train a network based on a diffusion model to synthesize low-resolution new perspective images of a human hand: Use the normal map corresponding to the target image estimated in Step 2 as a condition, and use the diffusion model to synthesize the target perspective based on the given input image. In order to ensure the efficiency of perspective synthesis, this network is used to synthesize low-resolution images. Step 4: Pre-train a super-resolution module based on a generative adversarial network, enabling the module to improve the resolution and quality of low-resolution images. Step 5: Jointly train the diffusion model and the generative adversarial network, and cascade the generative adversarial network after the diffusion model: By jointly training the diffusion model and the super-resolution module, it can be applied to the human hand image synthesis task. Specifically, for the low-resolution image obtained in step 3, it is sent to the pre-trained super-resolution module in step 4 to improve the resolution and quality of the synthesized low-resolution image, and finally obtain the desired target image.

2. The novel perspective human hand image synthesis method based on a generative model according to claim 1, characterized in that: Step 1 involves preparing paired training data by organizing and selecting paired data from the open-source multi-view hand datasets Interhand2.6M and Hand4K, including input images and target images.

3. The novel perspective human hand image synthesis method based on a generative model according to claim 1, characterized in that: The normal map estimation network in step 2 consists of an encoder and a decoder, used to estimate the normal map corresponding to the target image. For feature maps of the same scale between the encoder and decoder, residual connections are used for concatenation. Both the encoder and decoder consist of 5 residual blocks, and LeakyReLU is used as the activation function after each layer. The designed normal map estimation network has low-resolution images as input and output, i.e., 64×64, and the training process is supervised by the following loss function: Among them, L nor This represents the loss function used to estimate the network by training the normal map. This represents the predicted normal map; the superscript L indicates a low-resolution image. Represents the truth value of the normal map.

4. The novel perspective human hand image synthesis method based on a generative model according to claim 1, characterized in that: The diffusion model used in step 3 has two parts in its generation process: a noise-adding process and a noise-removing process. The noise-adding process refers to adding Gaussian noise to the image, while the noise-removing process is to gradually remove noise in an iterative manner to synthesize the target image.

5. The novel perspective human hand image synthesis method based on a generative model according to claim 1, characterized in that: Step 3 involves the following steps: First, the normal map estimated by the normal map estimation network, which is consistent with the target image, is used as one of the conditions for the generation process. Second, Gaussian noise is added to the target image to obtain a noise map. Third, for the input images in the paired training data, corresponding feature information is extracted through a feature encoder network and used as another condition in the generation process. Finally, using the two conditions obtained above, a UNet-structured neural network iteratively denoises the noise map to synthesize a low-resolution new perspective image of the human hand. This process can be supervised by the following loss function: in, This represents the conditions required during the synthesis process, and the predicted normal map. Used to control the consistency of human hand structure across viewpoints, the input is a low-resolution image x. L Used to control the consistency of the appearance of a human hand across different viewpoints; "Energy function" is a general term used here to indicate the training process of the diffusion model; "t" represents the step size; "w" represents the step size. t The weights are defined by the time step and set to 1, where ∈ represents noise; and ∈ φ This indicates a noise predictor, i.e., a UNet-structured network used in the denoising process; The image represents a noisy image; L represents a low-resolution image, i.e., 64×64. Specifically, the feature information extracted from the input image is applied to the self-attention modules with resolutions of 16 and 8 in UNet. Let the output of the previous layer of UNet be F, and the feature information obtained after the input image passes through the feature encoder network be F. m The output of the self-attention layer is represented by the following formula: Where Q represents the query value in the self-attention layer, K represents the key value in the self-attention layer, V represents the input feature value in the self-attention layer, and F... attn F represents the intermediate result of the self-attention layer. o This represents the output of the self-attention layer, where C represents a constant and is set to 0. softmax(·) represents the softmax function, Con(·) represents a 1D convolutional layer, Nor(·) represents a normalization layer, and GroupNorm is selected.

6. The novel perspective human hand image synthesis method based on a generative model according to claim 1, characterized in that: Step 4 involves designing a super-resolution module based on a generative adversarial network (GAN). This module consists of a generator and a discriminator. The generator comprises convolutional layers with residual connections, using InstanceNorm2d and ReLU as normalization and activation functions, respectively. The discriminator also consists of convolutional layers, but uses LeakyReLU as the activation function. A sigmoid layer is concatenated at the last layer of the discriminator to predict a probability between 0 and 1, used to distinguish whether the input image is real or synthetic. Furthermore, the designed super-resolution module is conditional on the predicted normal map, which helps correct unreasonable images during the synthesis process and allows the module to be cascaded with the diffusion model using the predicted normal map. The designed GAN is supervised using the following loss function: in, and y H These represent the normal map corresponding to the target image and the high-resolution target image, respectively. A normal plot representing the prediction; This represents a low-resolution image synthesized using a diffusion model. This represents a generator; while Indicates the discriminator; This indicates that the loss function is used to train the generator and discriminator; E is a general term for energy functions, where... and Both are used to indicate the training process of the discriminator; the former is used to identify the true result, and the latter is used to identify the synthesized result. In addition to the loss function mentioned above, the reconstruction loss function is also used to supervise this process, meaning that the synthesized result of the generator must remain consistent with the true value. ,in This represents the loss function used in this process.