A surface normal estimation method, device, equipment and medium of a target object
By combining preprocessing of flash and non-flash images with network structure, the shape-reflectivity ambiguity problem of surface normal estimation under single image input is solved, achieving high-precision normal estimation and detail recovery under complex lighting conditions.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING UNIV OF POSTS & TELECOMM
- Filing Date
- 2025-01-21
- Publication Date
- 2026-06-19
AI Technical Summary
Existing deep learning-based surface normal estimation methods struggle to handle objects with rich detail and blurred images. Single image inputs are easily affected by shadows and ambient lighting conditions, making it difficult to resolve shape-reflectivity ambiguities.
By combining flash and non-flash images, and through a network structure consisting of a preprocessing module, an encoder module, a diffusion denoising module, and a decoder module, noise is removed and surface details of the object are extracted using diffusion priors and Gaussian noise training, generating a high-quality normal map.
Accurate estimation of surface normals of target objects under complex lighting conditions, especially for detailed surfaces, has significant advantages, improving the accuracy of normal maps and their ability to represent details.
Smart Images

Figure CN120147395B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer vision technology, and in particular to a method, apparatus, device and medium for estimating the surface normal of a target object. Background Technology
[0002] Surface normal maps are a 2.5-dimensional representation of geometry, and high-quality surface normals are essential for various computer vision tasks, such as 3D shape reconstruction, augmented reality applications, and advanced rendering techniques. However, creating detailed surface normal maps typically requires either high-precision 3D scanning or collecting controlled-light images in a darkroom for reconstruction using photometric stereo methods—both of which are cumbersome and resource-intensive.
[0003] Recent advances in deep learning-based surface normal estimation largely rely on a single RGB image as input to estimate the corresponding surface normal map. While popular, these methods often struggle with richly detailed and blurry objects due to the inherent shape-reflectivity ambiguity of a single image input. Specifically, a single image may be affected by additional shadows that obscure object details, and ambient lighting conditions during capture may not fully reveal the scene's true geometry. Summary of the Invention
[0004] To address the aforementioned technical problems, this application provides a method, apparatus, device, and medium for estimating the surface normal of a target object. By using flash and non-flash images and diffusion priors, the estimation quality of the surface normal is improved, and shape-reflectivity ambiguity is reduced, thereby enabling more accurate reconstruction of the detailed object surface.
[0005] The first aspect of this application provides a method for estimating the surface normal of a target object, the method comprising: acquiring a first image of the target object under flash lighting conditions and a second image under non-flash lighting conditions;
[0006] Preprocess the first image and the second image;
[0007] The preprocessed image is input into a pre-trained network for normal estimation to obtain the normal map of the target object.
[0008] The preset network includes an encoder module, a diffusion denoising module, and a decoder module.
[0009] In some embodiments of this application, the preprocessing of the first image and the second image includes:
[0010] Obtain the mask images of the first image and the second image;
[0011] The target object is cropped based on the mask image to obtain an image corresponding to the main body of the target object;
[0012] Based on the image corresponding to the main body, a bounding box is calculated and generated to define the region of the main body. The image defined by the bounding box is used as the first image and the second image after preprocessing.
[0013] In some embodiments of this application, the step of inputting the preprocessed image into a trained preset network for normal estimation to obtain the normal map of the target object includes:
[0014] The preprocessed first image and the preprocessed second image are respectively input into the encoder module to extract latent features, thereby obtaining a first feature map corresponding to the first image and a second feature map corresponding to the second image;
[0015] The first feature map, the second feature map, and the Gaussian noise map are merged along the channel dimension to obtain a merged feature map;
[0016] The merged feature map is sequentially input into the diffusion denoising module and the decoder module to obtain the normal map of the target object.
[0017] In some embodiments of this application, the step of sequentially inputting the merged feature map into the diffusion denoising module and the decoder module to obtain the normal map of the target object includes:
[0018] The merged feature map is input into the diffusion denoising module, which gradually removes noise from the merged feature map to obtain a normal map feature with object surface details.
[0019] The normal map features are input into the decoder module, which then uses a deconvolution operation to restore the normal map features to the normal map of the target object.
[0020] In some embodiments of this application, the training method of the preset network includes:
[0021] Acquire synthetic and real datasets, wherein the synthetic dataset includes computer-generated images with virtual target objects, and the real dataset includes photographed images of real objects;
[0022] The synthetic dataset and the real dataset are preprocessed to obtain flash images and non-flash images suitable for input.
[0023] The preset network is trained based on the flash image and the non-flash image.
[0024] In some embodiments of this application, the training method for the preset network further includes:
[0025] When training the diffusion denoising module, Gaussian noise is added to the real normal map to simulate different levels of noise interference, so that the diffusion denoising module can learn to recover the real normal map from the noise.
[0026] In some embodiments of this application, after obtaining the normal map of the target object, the method further includes: performing three-dimensional reconstruction based on the normal map of the target object.
[0027] A second aspect of this application provides a surface normal estimation device for a target object, the device comprising:
[0028] The acquisition module is used to acquire a first image of the target object under flash lighting conditions and a second image under non-flash lighting conditions;
[0029] The processing module is used to preprocess the first image and the second image;
[0030] The estimation module is used to input the preprocessed image into a trained preset network to estimate the normals and obtain the normal map of the target object.
[0031] The preset network includes an encoder module, a diffusion denoising module, and a decoder module.
[0032] A third aspect of this application provides an electronic device, including a memory and a processor. The memory stores computer-readable instructions, which, when executed by the processor, cause the processor to perform the surface normal estimation method for a target object as described in various embodiments of this application.
[0033] A fourth aspect of this application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the surface normal estimation method for a target object as described in various embodiments of this application.
[0034] The technical solutions provided in this application embodiment have at least the following technical effects or advantages:
[0035] The surface normal estimation method for the target object described in various embodiments of this application acquires a first image of the target object under flash lighting conditions and a second image under non-flash lighting conditions. The first and second images are preprocessed, and the preprocessed images are input into a trained preset network for normal estimation to obtain the normal map of the target object. The preset network includes an encoder module, a diffusion denoising module, and a decoder module. Thus, by combining flash and non-flash lighting images and utilizing a diffusion prior denoising method, this method can accurately estimate the surface normals of the target object under complex lighting conditions, and it has a significant advantage, especially for the estimation of normals on detailed surfaces.
[0036] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and do not limit this application. Attached Figure Description
[0037] Various other advantages and benefits will become apparent to those skilled in the art upon reading the following detailed description of preferred embodiments. The accompanying drawings are for illustrative purposes only and are not intended to limit the scope of this application. Furthermore, the same reference numerals denote the same parts throughout the drawings. In the drawings:
[0038] Figure 1 This is a schematic diagram illustrating the steps of a surface normal estimation method for a target object in an exemplary embodiment of this application;
[0039] Figure 2 This is a schematic diagram of the structure of a preset network in an exemplary embodiment of this application;
[0040] Figure 3 This is a schematic diagram of a shooting device in an exemplary embodiment of this application;
[0041] Figure 4 This is a schematic diagram of a real dataset captured in an exemplary embodiment of this application;
[0042] Figure 5 This is a schematic diagram of a normal estimation process using a preset network in an exemplary embodiment of this application;
[0043] Figure 6 This is a schematic diagram of the structure of a surface normal estimation device for a target object according to an exemplary embodiment of this application;
[0044] Figure 7 This is a schematic diagram of the structure of an electronic device provided in an exemplary embodiment of this application.
[0045] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and do not limit this application. Detailed Implementation
[0046] The present application will now be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the embodiments depicted herein are for illustrative purposes only and are not intended to limit the invention. Furthermore, it should be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings.
[0047] Existing technologies include numerous deep learning-based surface normal estimation methods; however, most of these methods use a single RGB image as input and estimate the corresponding surface normal map. Due to the inherent shape-reflectivity ambiguity of a single image input, they often struggle to handle objects with rich detail and blurred features. Specifically, a single image may be affected by additional shadows that obscure object details, and ambient lighting conditions during capture may not fully reveal the true geometry of the scene.
[0048] Therefore, this application provides a method for estimating the surface normal of a target object in some embodiments, such as... Figure 1 As shown, the method includes steps S1-S3.
[0049] S1. Acquire the first image of the target object under flash lighting conditions and the second image under non-flash lighting conditions;
[0050] S2. Preprocess the first image and the second image;
[0051] S3. Input the preprocessed image into the trained preset network for normal estimation to obtain the normal map of the target object. Wherein, as... Figure 2 As shown, the preset network includes an encoder module, a diffusion denoising module, and a decoder module.
[0052] In one possible implementation, acquiring a first image of the target object under flash conditions and a second image under non-flash conditions can be achieved by using... Figure 3The illustrated imaging device includes a camera phone with a flash. To accurately acquire image information under different lighting conditions, the relative position between the real-world object (the target object) and the camera phone is first adjusted. Then, a first image under flash conditions is acquired when the flash is on, and a second image under non-flash conditions is acquired when the flash is off. By inputting the preprocessed first image (flash image) and second image (non-flash image) into the encoder module respectively, their respective latent features are extracted and merged. This effectively fuses the flash and non-flash image information, fully exploiting their complementary characteristics in object surface normal estimation. This operation provides richer input information for subsequent normal estimation, helping to improve the algorithm's accuracy in capturing object surface details under different lighting conditions. By merging these two image features, the network can achieve a balance between detail capture and noise removal, thereby improving the quality of the final estimated normal map and avoiding the local distortion or loss of detail that may exist in a single image.
[0053] Figure 3 The system also includes a 3D scanner for scanning real-world objects, generating a 3D mesh to obtain a precise mesh representation of the target object. The data collected by the 3D scanner is used for normal map alignment to obtain aligned normal maps of the real object's surface, which are then used as a reference for training the preset mesh with real data (which can be used as labels). We need to pre-train a preset network. When training the preset network, we first need to collect a training dataset and a test dataset. This application uses synthetic data as the training dataset and a real dataset as the test dataset. In one possible implementation, the training method for the preset network includes: acquiring a synthetic dataset and a real dataset, wherein the synthetic dataset includes computer-generated images with virtual target objects, and the real dataset includes images of real objects taken by camera; preprocessing the synthetic dataset and the real dataset to obtain flash images and non-flash images suitable for input; and training the preset network based on the flash images and the non-flash images.
[0054] Specifically, the Blender tool was used to acquire the composite data. Blender rendered realistic surface normal maps of the object under flash and non-flash conditions as training data for the composite. More specifically, the Cycles renderer within Blender was used to generate the composite data. Blender is a powerful open-source 3D computer graphics software, and its Cycles rendering engine supports real-time rendering. In this application, Blender was used to synthesize the training data. By placing the photographed object in 3D space, adding ambient light sources, and setting camera intrinsic and extrinsic parameters, the results of capturing images using an RGB camera in a real-world scene were simulated. Simultaneously, the surface normal information of the object was also obtained using this software. (See also...) Figure 3 Images are captured using a Canon camera as real test data, and a 3D scanner is used to scan real objects to obtain their three-dimensional mesh digital representation. After that, manual alignment can be performed to obtain a normal map of the real object's surface. Figure 4 This is a schematic diagram of a real dataset captured in an exemplary embodiment of this application, such as... Figure 4 As shown, this application photographed and aligned dozens of objects. This test set can be used as a basic test set to test the normal estimation algorithm, enabling a better evaluation of its quality.
[0055] In one possible implementation, the preprocessing of the first image and the second image includes: obtaining mask images of the first image and the second image; cropping the target object based on the mask images to obtain an image corresponding to the main body of the target object; calculating and generating bounding boxes based on the image corresponding to the main body to define the region of the main body, for example, calculating the bounding boxes of the object in four directions in the image using the non-zero pixel value coordinates of the object boundary position in the mask image, thereby providing the object region for further image processing; and using the bounding box-defined image as the preprocessed first image and the preprocessed second image.
[0056] It's important to note that during the training phase, for the synthetic data, the object's mask image can be obtained from the Blender renderer. After obtaining the mask image, dividing its pixel value range by 255 yields a mask image with pixel values of 0 or 1. Multiplying this mask image by the RGB image yields an RGB image containing only the object's main body. Furthermore, the bounding box can be obtained from the coordinates of the non-zero positions of the object's boundaries in the mask image. Processing the output image from the previous step with the bounding box yields the synthetic data with the object's main body enlarged, which is used as input during the training of the preset network. However, since the real data lacks a mask, a mask image of the real data can be obtained by performing a cutout operation on the RGB image.
[0057] In one possible implementation, the preprocessed image is input into a trained preset network for normal estimation to obtain the normal map of the target object. This includes: inputting the preprocessed first image and the preprocessed second image into the encoder module to extract latent features, obtaining a first feature map corresponding to the first image and a second feature map corresponding to the second image; merging the first feature map, the second feature map, and the Gaussian noise map along the channel dimension to obtain a merged feature map; inputting the merged feature map into the diffusion denoising module, which progressively removes noise from the merged feature map to obtain normal map features with object surface details; and inputting the normal map features into the decoder module, which uses a deconvolution operation to restore the normal map features to the normal map of the target object.
[0058] It should be noted that merging the first feature map, the second feature map, and the Gaussian noise map along the channel dimension allows the first and second feature maps to guide the denoising process of the Gaussian noise map. This enables the model to simultaneously learn to identify and remove noise when performing normal map estimation tasks. Therefore, the use of the Gaussian noise map not only simulates noise situations that may be encountered in the real world, but more importantly, it provides the model with specific guidelines or paths to follow when dealing with such noise.
[0059] Preferably, refer to Figure 5 The encoder module is based on a variational autoencoder (VAE) structure, which can compress the input image into a low-dimensional latent space. For example... Figure 5 As shown, the preprocessed first image (flash image) Figure 5 C in (f) ) and the second image (non-flash image, Figure 5 C in (nf) The two images are input into a variational autoencoder, and feature extraction is performed on each image to obtain the corresponding latent feature maps. Then, the two images are merged along the channel dimension, which can be represented by formula (1):
[0060] z (fnf) =concat(εC (f) ),ε(C (nf) ))Formula (1)
[0061] Where concat represents merging by channel dimension, z (fnf) The obtained flash and non-flash feature vectors, i.e., the merged feature map, are (εC) (f) ),ε(C (nf) )) represents the first feature map and the second feature map. After obtaining z (fnf)Then, the Gaussian noise is merged with it to obtain the input data input to the diffusion denoising module. Preferably, the diffusion denoising module is based on a U-shaped network structure, and the merged feature map is input... Figure 5 The denoising U-shaped network shown progressively removes noise from the merged feature map, obtaining normal map features with object surface details. This denoising U-shaped network is pre-trained and utilizes diffusion prior details to progressively remove noise from the feature map, recovering normal map features with object surface details. It can be understood that the diffusion denoising module, after inputting the merged feature map, progressively removes noise from the image, effectively recovering object surface details. Specifically, the diffusion denoising module, through progressive denoising, not only preserves the detailed information of the surface structure but also improves the accuracy of the normal map by removing redundant noise. This step greatly enhances the image's detail representation, making the normal maps estimated from flash and non-flash images more refined and clear, and effectively mitigating the impact of reflection differences caused by different light sources on normal estimation.
[0062] It should be noted that during the training of the diffusion denoising module, Gaussian noise is added to the true normal map to simulate different levels of noise interference, enabling the module to learn to recover the true normal map from the noise. In other words, by adding Gaussian noise to the true normal map during training, different levels of noise interference are simulated, prompting the module to learn to recover the true normal map from the noise. In this way, the diffusion denoising module acquires the ability to remove noise, allowing it to maintain high accuracy even when facing complex noise sources (such as noise or blur that may occur during image capture). This training method significantly enhances the algorithm's noise resistance, thereby improving the accuracy of object surface normal estimation. Especially in practical applications, noise is often unavoidable; therefore, this training strategy provides the algorithm with stronger adaptability for application in real-world environments.
[0063] For example, when training a denoising U-shaped network, this application obtains a noisy real latent space normal feature map by adding Gaussian noise to the normal map of a real object. Where t represents the current time step, and y represents the predicted normal map that will follow this variable. With z (fnf) The variables z are merged and then input into the denoising U-shaped network. t See formula (2):
[0064]
[0065] Subsequently, the denoising U-shaped network learns its parameters through backpropagation, using the mean squared error between the predicted and actual noise as the loss function. During training, only the parameters in the denoising U-shaped network are updated. The encoder module uses pre-trained weights, and the parameters are not updated during training; instead, time steps are sampled from 1 to T.
[0066] During the reasoning process, by using z (fnf) The Gaussian noise is then combined with the input noise to the U-shaped denoising network, as shown in formulas (3) and (4):
[0067]
[0068] in This is the predicted normal plot at time step T. The arrow above z indicates that this variable is predicted, while the y-axis in the upper right corner indicates that this variable is a normal plot. t This represents the variable currently input into the U-shaped network, with a time step of t. The denoising process is shown in formula (5):
[0069]
[0070] Among them, v θ (z t ,t) represents the input z t The rate value predicted by the denoising U-shaped network at time step t, where a and b are different weight coefficients. Denoising is performed after multiple iterations, as shown in formula (6):
[0071]
[0072] z t After being input into the denoising U-shaped network, the predicted noise at time step t can be obtained. First, calculate... The lower right corner t->0 indicates that after the time step changes from t to 0, the calculated noise is added back. In this process, we obtain the latent space variables at time step t-1. After performing multi-step denoising, i.e., repeating denoising T times, we obtain a clean predicted feature map. The 1->0 in the lower right corner indicates that the time step changes from 1 to 0, and then we send it into the decoder module for decoding.
[0073] It should be noted that the diffusion denoising module also includes a CLIP model. CLIP (Contrastive Language–Image Pretraining) is a multimodal model proposed by OpenAI that maps text and images to the same embedding space. The core idea of CLIP is to train the model through contrastive learning, enabling it to understand the semantic relationships between text and images. Specifically, the CLIP model is integrated into the denoising U-shaped network and uses input text prompts (such as...) Figure 5 The CLIP model uses "object geometry" to assist in the normal estimation task, providing high-level semantic information within the denoising U-shaped network to guide the denoising process. Specifically, text prompts are input into the CLIP model, generating a text embedding vector. This vector is combined with the input feature map of the denoising U-shaped network as additional contextual information, helping the network better understand the geometric characteristics of the target object. Shape-reflectivity ambiguity is a common problem in normal estimation tasks, where the network may struggle to distinguish between the object's surface geometry and material reflectivity. The CLIP model provides explicit geometric information through text prompts, guiding the denoising network to focus more on the object's geometric characteristics rather than the influence of material or lighting, thus effectively mitigating this problem.
[0074] In a further implementation, the normal map features are input to the decoder module, which then uses a deconvolution operation to reconstruct the normal map features into the normal map of the target object. Decoder module ( Figure 5 Decoder in variational autoencoder The output normal map is the final predicted normal map of the target object, containing rich surface details and effectively reducing shape-reflectivity ambiguity through diffusion denoising. Furthermore, after obtaining the normal map of the target object, 3D reconstruction based on it enables the restoration of the 3D object shape from the normal map. This process provides a solid foundation for accurate modeling of object surface details. With high-quality normal maps, accurate reconstruction of the object's surface shape is possible, providing precise 3D information for applications such as 3D modeling, virtual reality, and augmented reality. Especially when estimating normals using flash and non-flash images, high-precision normal maps provide more realistic and detailed object surface details in 3D reconstruction, significantly improving the accuracy and effectiveness of the 3D reconstruction.
[0075] In other embodiments of this application, a surface normal estimation device for a target object is provided, such as... Figure 6 As shown, the device includes:
[0076] The acquisition module 601 is used to acquire a first image of the target object under flash lighting conditions and a second image under non-flash lighting conditions;
[0077] Processing module 602 is used to preprocess the first image and the second image;
[0078] The estimation module 603 is used to input the preprocessed image into a trained preset network for normal estimation to obtain the normal map of the target object;
[0079] The preset network includes an encoder module, a diffusion denoising module, and a decoder module.
[0080] The surface normal estimation device for the target object also includes an evaluation module. This module calculates the mean angle error (MAE) between the normal vector and the true normal vector based on the normal map output by the preset network, using this as a quantitative evaluation index to assess the accuracy of the normal estimation. By combining flash and non-flash images and utilizing a diffusion prior denoising method, this method can accurately estimate the surface normal of the target object under complex lighting conditions.
[0081] Please refer to the following. Figure 7 This illustrates a schematic diagram of an electronic device provided by some embodiments of this application. For example... Figure 7 As shown, the electronic device 2 includes: a processor 200, a memory 201, a bus 202, and a communication interface 203. The processor 200, the communication interface 203, and the memory 201 are connected via the bus 202. The memory 201 stores a computer program that can run on the processor 200. When the processor 200 runs the computer program, it executes the surface normal estimation method for a target object provided in any of the foregoing embodiments of this application. The method includes: acquiring a first image of the target object under flash lighting conditions and a second image under non-flash lighting conditions; preprocessing the first image and the second image; inputting the preprocessed image into a trained preset network for normal estimation to obtain a normal map of the target object; wherein the preset network includes an encoder module, a diffusion denoising module, and a decoder module.
[0082] The memory 201 may include high-speed random access memory (RAM) or non-volatile memory, such as at least one disk storage device. Communication between this system network element and at least one other network element is achieved through at least one communication interface 203 (which can be wired or wireless), such as the Internet, wide area network, local area network, or metropolitan area network.
[0083] Bus 202 can be an ISA bus, PCI bus, or EISA bus, etc. The bus can be divided into an address bus, a data bus, a control bus, etc. The memory 201 is used to store programs. After receiving an execution instruction, the processor 200 executes the program. The control method disclosed in any of the foregoing embodiments of this application can be applied to the processor 200, or implemented by the processor 200.
[0084] The processor 200 may be an integrated circuit chip with signal processing capabilities. In implementation, each step of the above method can be completed by the integrated logic circuitry in the hardware of the processor 200 or by instructions in software form. The processor 200 may be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc.; it may also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. It can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor may be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this application can be directly embodied in the execution of a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software modules may reside in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other mature storage media in the art. The storage medium is located in memory 201. The processor 200 reads the information in memory 201 and, in conjunction with its hardware, completes the steps of the surface normal estimation method for the target object.
[0085] This application also provides a computer-readable storage medium corresponding to the surface normal estimation method for a target object provided in the foregoing embodiments, wherein a computer program is stored thereon, and the computer program, when run by a processor, executes the surface normal estimation method for a target object provided in any of the foregoing embodiments.
[0086] In addition, examples of the computer-readable storage medium may include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other optical and magnetic storage media, which will not be described in detail here.
[0087] In addition, this application also provides a computer program product, including a computer program that, when executed by a processor, implements the surface normal estimation method for a target object provided in any of the foregoing embodiments. The method includes: acquiring a first image of the target object under flash lighting conditions and a second image under non-flash lighting conditions; preprocessing the first image and the second image; inputting the preprocessed image into a trained preset network for normal estimation to obtain a normal map of the target object; wherein the preset network includes an encoder module, a diffusion denoising module, and a decoder module.
[0088] Those skilled in the art will understand that the various component embodiments of this application can be implemented in hardware, or as software modules running on one or more processors, or a combination thereof. Those skilled in the art should understand that microprocessors or digital signal processors (DSPs) can be used in practice to implement some or all of the functions of some or all of the components in the virtual machine creation apparatus according to embodiments of this application.
[0089] The above description is merely a preferred embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A method for estimating the surface normal of a target object, characterized in that, The method includes: Acquire a first image of the target object under flash lighting conditions and a second image under non-flash lighting conditions; Preprocess the first image and the second image; The preprocessed image is input into a pre-trained network for normal estimation to obtain the normal map of the target object. The preset network includes an encoder module, a diffusion denoising module, and a decoder module; the diffusion denoising module integrates a CLIP model, which is used to receive text prompts and generate text embedding vectors. The step of inputting the preprocessed image into a trained preset network for normal estimation to obtain the normal map of the target object includes: The preprocessed first image and the preprocessed second image are respectively input into the encoder module to extract latent features, thereby obtaining a first feature map corresponding to the first image and a second feature map corresponding to the second image; The first feature map, the second feature map, and the Gaussian noise map are merged along the channel dimension to obtain a merged feature map; The merged feature map is sequentially input into the diffusion denoising module and the decoder module to obtain the normal map of the target object; This includes combining the text embedding vector with the merged feature map, which together serve as the input to the diffusion denoising module. The text embedding vector is used as a semantic condition to guide the denoising process, thereby alleviating shape-reflectivity ambiguity.
2. The method for estimating the surface normal of a target object according to claim 1, characterized in that, The preprocessing of the first image and the second image includes: Obtain the mask images of the first image and the second image; The target object is cropped based on the mask image to obtain an image corresponding to the main body of the target object; Based on the image corresponding to the main body, a bounding box is calculated and generated to define the region of the main body. The image defined by the bounding box is used as the first image and the second image after preprocessing.
3. The surface normal estimation method for a target object according to claim 1, characterized in that, The step of sequentially inputting the merged feature map into the diffusion denoising module and the decoder module to obtain the normal map of the target object includes: The merged feature map is input into the diffusion denoising module, which gradually removes noise from the merged feature map to obtain a normal map feature with object surface details. The normal map features are input into the decoder module, which then uses a deconvolution operation to restore the normal map features to the normal map of the target object.
4. The surface normal estimation method for a target object according to claim 1, characterized in that, The training method for the preset network includes: Acquire synthetic and real datasets, wherein the synthetic dataset includes computer-generated images with virtual target objects, and the real dataset includes photographed images of real objects; The synthetic dataset and the real dataset are preprocessed to obtain flash images and non-flash images suitable for input. The preset network is trained based on the flash image and the non-flash image.
5. The surface normal estimation method for a target object according to claim 4, characterized in that, The training method for the preset network also includes: When training the diffusion denoising module, Gaussian noise is added to the real normal map to simulate different levels of noise interference, so that the diffusion denoising module can learn to recover the real normal map from the noise.
6. The surface normal estimation method for a target object according to claim 1, characterized in that, After obtaining the normal map of the target object, the method further includes: performing three-dimensional reconstruction based on the normal map of the target object.
7. A device for estimating the surface normal of a target object, characterized in that, The device includes: The acquisition module is used to acquire a first image of the target object under flash lighting conditions and a second image under non-flash lighting conditions; The processing module is used to preprocess the first image and the second image; The estimation module is used to input the preprocessed image into a trained preset network to estimate the normals and obtain the normal map of the target object. The preset network includes an encoder module, a diffusion denoising module, and a decoder module; the diffusion denoising module integrates a CLIP model, which is used to receive text prompts and generate text embedding vectors. The step of inputting the preprocessed image into a trained preset network for normal estimation to obtain the normal map of the target object includes: The preprocessed first image and the preprocessed second image are respectively input into the encoder module to extract latent features, thereby obtaining a first feature map corresponding to the first image and a second feature map corresponding to the second image; The first feature map, the second feature map, and the Gaussian noise map are merged along the channel dimension to obtain a merged feature map; The merged feature map is sequentially input into the diffusion denoising module and the decoder module to obtain the normal map of the target object; This includes combining the text embedding vector with the merged feature map, which together serve as the input to the diffusion denoising module. The text embedding vector is used as a semantic condition to guide the denoising process, thereby alleviating shape-reflectivity ambiguity.
8. An electronic device comprising a memory and a processor, characterized in that, The memory stores computer-readable instructions, which, when executed by the processor, cause the processor to perform the surface normal estimation method for the target object as described in any one of claims 1-6.
9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the computer program implements the surface normal estimation method for the target object as described in any one of claims 1-6.