Polarization-guided high light removal method and system
By constructing a two-stage polarization-guided highlight removal network and utilizing polarization imaging characteristics and wavelet-enhanced Transformer, accurate detection of highlight regions and high-quality restoration of pseudo-diffuse reflection regions are achieved. This addresses the shortcomings of existing highlight removal methods and improves image restoration results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- WUHAN TEXTILE UNIV
- Filing Date
- 2026-05-20
- Publication Date
- 2026-06-19
AI Technical Summary
Existing highlight removal methods suffer from insufficient detection accuracy in highlight areas, inadequate calibration in pseudo-diffuse reflection areas, low feature learning efficiency, imperfect conditional injection mechanisms, and insufficient high-resolution processing capabilities, resulting in blurring and artifacts in the restored images.
A two-stage polarization-guided highlight removal network is constructed, which utilizes the physical characteristics of polarization imaging to extract multi-dimensional physical polarization features. The network uses a residual diffusion model and a dual-path conditional recovery network to accurately detect highlight areas and repair pseudo-diffuse reflection areas with high quality. By combining the feature learning capabilities of wavelet-enhanced Transformer, efficient feature fusion and image detail preservation are achieved.
It achieves accurate detection of highlight areas and effective calibration of pseudo-diffuse reflection areas, improving the accuracy and detail preservation of image restoration, and solving the problems of blur and artifact restoration in traditional methods.
Smart Images

Figure CN122243826A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of computer vision and digital image processing, and relates to a highlight removal method, specifically a polarization-guided highlight removal method. It utilizes the physical characteristics of polarization imaging and the feature learning capabilities of deep learning networks to achieve accurate detection of highlight areas and high-quality restoration of pseudo-diffuse reflection areas in images, and is applicable to various image restoration scenarios with highlight interference. Background Technology
[0002] In the fields of computer vision and digital image processing, specular highlights in images are areas of intense brightness formed by specular reflection when light strikes an object's surface. These areas can obscure crucial information such as texture and color, severely impacting the performance of subsequent visual tasks like object detection, image segmentation, and feature extraction. Therefore, specular removal, as a fundamental and critical preprocessing technique, has always been a research hotspot and challenge in computer vision.
[0003] Traditional highlight removal methods are mainly based on classic image processing algorithms and can be divided into the following categories: The first category is based on color space transformation methods. These methods utilize the distribution characteristics of highlight regions in a specific color space to detect and remove highlights. For example, in the HSV color space, the saturation of highlight regions is low, and highlight removal is achieved by adjusting the saturation channel. However, this type of method is prone to causing image color distortion and does not perform well in handling complex highlights. The second category is based on image fusion methods. These methods acquire multiple images under different lighting conditions and fuse them to obtain a highlight-free image. However, this type of method requires multiple image acquisitions, is complex to operate, and is difficult to apply in dynamic scenes. The third category is based on partial differential equation methods. These methods construct partial differential equation models and use the gradient information of the image to diffuse the brightness of highlight regions to achieve highlight removal. However, this type of method is prone to causing image blurring and loss of detail information.
[0004] With the rapid development of deep learning technology, highlight removal methods based on neural networks have become the mainstream research approach. Compared with traditional methods, deep learning methods can automatically learn high-order features of images, significantly improving their ability to handle complex highlights. Existing deep learning highlight removal methods can be mainly divided into two categories: one is based on end-to-end generative networks, which directly construct models such as Generative Adversarial Networks (GANs) and U-shaped convolutional neural networks (U-Nets), taking a highlighted image as input and outputting a highlight-free image. Although this type of method is simple to operate, it is prone to overfitting during the network learning process due to the lack of physical prior constraints. When faced with highlights under different lighting conditions and on different material surfaces, its generalization ability is insufficient, and it is prone to artifacts and boundary distortion. The other type is based on polarization-assisted deep learning methods, which attempt to introduce physical information provided by polarized images as auxiliary constraints to improve the highlight removal effect. However, most of these methods simply stitch polarization features together and input them into the network without fully exploring the physical semantics of polarization imaging. They cannot achieve efficient fusion of polarization features and network, resulting in low utilization efficiency of polarization information and still struggling to solve the problem of pseudo-diffuse reflection region calibration after highlight removal.
[0005] In addition, existing highlight removal methods have the following technical drawbacks:
[0006] (1) Unreasonable task design: Most of them directly generate images without highlights, which is more difficult and leads to insufficient prediction accuracy of the network for highlight areas, which easily results in incomplete or excessive highlight removal.
[0007] (2) Low feature learning efficiency: The polarization features were not effectively divided and targeted for learning. The network needs to distinguish the features of different regions such as highlights, ghosting, and diffuse reflection on its own. The learning efficiency is extremely low in small sample data scenarios.
[0008] (3) Imperfect conditional injection mechanism: When injecting polarization conditional features into the network, a fixed injection method is used, without considering the differences between deep and shallow network features and the requirements of different time steps in the diffusion process, resulting in poor feature fusion effect;
[0009] (4) Insufficient high-resolution processing capability: When processing high-resolution images, traditional Transformer networks have problems such as excessive computation and high memory usage, and it is difficult to effectively preserve the high-frequency details of the image, resulting in blurring of the repaired image.
[0010] (5) Insufficient pseudo-diffuse calibration: The pseudo-diffuse area generated after highlight removal is not given sufficient attention and there is a lack of effective calibration mechanism, resulting in color distortion and texture blurring in the pseudo-diffuse area of the restored image.
[0011] To address the shortcomings of existing technologies, this invention proposes a polarization-guided specular removal method. It constructs a two-stage polarization-guided specular removal network, introducing the concept of pseudo-diffuse reflection into the field of specular removal research for the first time. It leverages the physical characteristics of polarization imaging to construct strong prior constraints, while deeply integrating a dual-path conditional recovery architecture with the efficient feature learning capabilities of wavelet-enhanced Transformers. This effectively solves the technical defects of traditional specular removal methods, achieving accurate detection of specular regions and high-quality restoration of image details. Summary of the Invention
[0012] To overcome the shortcomings of the existing technologies, this invention proposes a polarization-guided specular removal method. It constructs a two-stage polarization-guided specular removal network and introduces the concept of pseudo-diffuse reflection into the field of specular removal research for the first time. It builds strong prior constraints based on the physical characteristics of polarization imaging and integrates the dual-path conditional recovery architecture with the efficient feature learning capability of wavelet-enhanced Transformer. This effectively solves the problems of blur, artifacts and boundary distortion in traditional specular removal methods.
[0013] The technical solution of this invention is: a polarization-guided specular removal method, comprising the following steps:
[0014] Step 1: Acquire the RGB image M with highlights and the polarization image P. Recover the multi-dimensional physical polarization features from the polarization image P, including the linear polarization degree DoLP, the polarization angle AoLP, and the diffuse reflection intensity. and the weighting coefficient w;
[0015] Step 2: Construct a polarization-guided residual diffusion model, which includes a three-path feature-based linear modulation encoder, a hybrid injection module, a time-step gating module, a dual-modal spatial mask generation module, and a pre-trained SD1.5Unet network. The three-path feature-based linear modulation encoder extracts multi-dimensional physical polarization features from the polarization image P through path-specific feature extraction, obtaining multi-scale conditional features. The hybrid injection module injects these multi-scale conditional features into the SD1.5Unet network, and the time-step gating module dynamically adjusts the conditional injection intensity. The dual-modal spatial mask generation module constrains the feature injection region. Finally, the model is combined with a specular RGB image M to accurately predict the specular residual map. ;
[0016] Step 3, based on the specular residual map Preliminary image without highlights was obtained through calculation. A dual-path conditional reconstruction network was constructed, which includes a dual-path polarization coding structure, a spatial feature transformation module, a hierarchical integration module, and a wavelet-enhanced Restormer Transformer network; this network was used to process preliminary images without highlights. The input is fed into the RestormerTransformer network, where global prior features and spatial detail features are extracted through a dual-path polarization coding structure. The global prior features are then injected into the RestormerTransformer network via a hierarchical ensemble module using a cross-attention approach, while the spatial feature maps are injected into the RestormerTransformer network via a pixel-by-pixel modulation approach using a spatial feature transformation module. This process refines and repairs the pseudo-diffuse reflection regions in the initial specular-free image, resulting in a high-quality specular-free image.
[0017] Furthermore, in step 1, the original polarization image P acquired by the DoFP polarization camera contains light intensity information in four polarization directions: 0°, 45°, 90°, and 135°. The linear polarization degree DoLP and diffuse reflection intensity are calculated using the following formulas. And weighting coefficient w:
[0018]
[0019]
[0020]
[0021]
[0022]
[0023] in, , , , These represent the light intensity values in the four polarization directions. and α and β are the maximum and minimum polarization light intensities, respectively; k is the diffuse reflection compensation coefficient, with a value range of [0.1, 0.3]; α and β are weight adjustment parameters, with α ranging from [5, 10] and β ranging from [0.3, 0.5]; and DoLP ranging from [0, 1]. Used to characterize the diffuse component of an image, providing constraints on the hue after specular removal; w is the polarization feature weighting coefficient.
[0024] Furthermore, in step 2, the three-path feature-based linear modulation encoder divides the polarization features into three independent paths according to physical semantics. Each path consists of an input layer, multiple multi-scale coding blocks, and a feature normalization layer. The three paths do not interfere with each other and focus on learning the features corresponding to their physical semantics.
[0025] High-light channel path: Input feature is [W pos [, DoLP, AoLP, w], where W posThe specular position weight map is constructed based on DoLP, and the calculation formula is W. pos = DoLP α α is a learnable parameter, and AoLP is the polarization angle; focusing on feature extraction and detection in strong highlight regions, the multi-scale coding block consists of a 3×3 convolutional layer, a batch normalization layer, a ReLU activation function, and a max pooling layer;
[0026] Ghost channel path: Input features are [DoLP, 1-localSSIM(M, )], where localSSIM is the local structural similarity index, and the input feature is [W ssim , F wavelet ], where W ssim The ghost detection weight map is constructed based on local structural similarity, and the calculation formula is W. ssim = 1 - localSSIM(M, ), localSSIM is the local structural similarity index; F wavelet To extract high-frequency energy features through wavelet transform, we focus on feature extraction and detection of ghost / weak reflection regions; the structure of the multi-scale coding block is consistent with the hyperbola path, and the number of output channels is the same to ensure feature scale matching;
[0027] Color channel path: Input feature is It provides color priors and tone constraints after specular removal; the multi-scale coding block adds an adaptive color adjustment layer after the convolutional layer, which optimizes the color expression of diffuse components by learning color mapping parameters;
[0028] The three paths output features at multiple scales respectively. When injected into the SD1.5 Unet network, the features at the corresponding scales of the three paths are concatenated along the channel dimension to form fused conditional features.
[0029] Furthermore, the hybrid injection module in step 2 includes a feature linear modulation submodule, a hierarchical integration submodule, and a weight fusion layer, used to achieve efficient fusion of conditional features and SD1.5 Unet network features:
[0030] Feature linear modulation submodule: Performs a linear transformation on the stitched fusion conditional features to generate scale parameter γ and offset parameter β, using the formula... The intermediate features of the SD1.5 Unet network are modulated, where This is an intermediate feature of a certain layer in the SD1.5 Unet network. These are the characteristics after FiLM modulation;
[0031] The hierarchical integration submodule employs a cross-attention mechanism to achieve the interactive fusion of conditional features and Unet features. First, the fused conditional features are mapped through a linear layer to a query vector Q, a key vector K, and a value vector V. Then, the SD1.5 Unet intermediate features are mapped to the target query vector Q'. This is achieved through the formula... and Calculate the cross-attention output, where Let be the dimension of the key vector, Attention be the attention mechanism, and Softmax be the soft-maximum normalization function. To output the projection weight matrix, Concat represents the channel dimension concatenation operation, F HIM Output characteristics of hierarchical integrated submodules;
[0032] Weighted Fusion Layer: This layer fuses the outputs of the feature linear modulation submodule and the hierarchical ensemble submodule using learnable weight parameters w'. The weight parameters w' are adaptively learned by the network and range from [0,1]. The fusion formula is as follows: .
[0033] Furthermore, the time-step gating module in step 2 includes a time-step encoding layer, a gating coefficient generation layer, and an intensity adjustment layer, used to dynamically adjust the conditional injection intensity at different diffusion time steps:
[0034] Time step coding layer: Performs sinusoidal coding on the diffusion time step t, converting the discrete time step into a continuous feature vector;
[0035] Gating coefficient generation layer: The encoded time-step features are input into a two-layer fully connected network, and the gating coefficients g are generated by the Sigmoid activation function, as shown in the formula. ,in and It is a fully connected layer, ReLU is a modified linear unit, PE(t) is a time-step sinusoidal code, b2 is a bias term, σ is a sigmoid activation function, and the gating coefficient g ranges from [0,1].
[0036] Intensity Adjustment Layer: The gating coefficient g is applied to the scale parameter γ, offset parameter β of the characteristic linear modulation submodule, and the attention weights of the hierarchical integration submodule, respectively. The adjustment formula is as follows: , and 'Attention' is the adjusted attention output.
[0037] Furthermore, the dual-modal spatial mask generation module in step 2 includes a ghost detection layer, a mask generation layer, and a mask modulation layer, used for the injection region of constraint features:
[0038] Ghost detection layer: Defines ghost detection metrics , where localSSIM is the local structural similarity index calculated based on a sliding window, and its value ranges from [0,1];
[0039] Mask generation layer: based on DoLP and Construct the specular mask, ghost mask, and merge mask separately. Specifically, the specular mask... Ghost mask and merge mask where k1 and Th1 and Th2 are slope parameters, all of which are learnable parameters. σ is the Sigmoid activation function, and max represents the operation of taking the maximum value pixel by pixel.
[0040] Mask modulation layer: This will merge the... The modulation formula is as follows: The scaling parameter γ and the offset parameter β act on the characteristic linear modulation submodule. and This makes the conditional features only affect the highlight and ghost areas.
[0041] Furthermore, the training process and loss function design of the polarization-guided residual diffusion model in step 2 are as follows:
[0042] Data preprocessing: The image M with highlights and the true image T without highlights are normalized to the interval [-1,1], and the true highlight residual is calculated. ;
[0043] Latent Spatial Mapping: Real Spectral Residual Input the pre-trained VAE encoder to obtain the latent representation. ;
[0044] Noise injection: Following the noise scheduling strategy of the diffusion process, the potential representation is... Adding Gaussian noise yields noisy latent representations at different time steps. The noise injection formula is: Where ε is standard Gaussian noise, This is the cumulative product coefficient for the diffusion process;
[0045] Denoising training: training the noisy latent representation The features of the specular image M and the polarization condition features are input into the SD1.5Unet network, and the output is the predicted noise. ;
[0046] Loss function: A composite loss function combining MSE loss and WSSIM loss is used, and the formula is as follows. ,in For the predicted specular residual map, and To balance the weights; For mean square error loss, The weighted structural similarity loss is used.
[0047] Furthermore, the dual-path polarization coding structure of the dual-path conditional recovery network in step 3 includes a global path and a spatial path, with the two paths operating in parallel:
[0048] Global path: Input fusion features are [ DoLP [, w], global prior features are obtained through a prior encoder. The prior encoder consists of multiple convolutional layers, batch normalization layers, ReLU activation functions, and global average pooling layers. After global average pooling, a global feature vector is obtained, which is then mapped to global prior features through two fully connected layers.
[0049] Spatial path: Input is a specular residual map The spatial conditional encoder outputs multi-scale spatial feature maps. The spatial conditional encoder adopts a U-shaped structure, which includes multiple downsampling blocks and multiple upsampling blocks. The downsampling blocks achieve feature dimensionality reduction and extraction through convolutional layers, while the upsampling blocks use transposed convolutional layers to restore the feature scale.
[0050] Furthermore, the spatial feature transformation module in step 3 includes a feature extraction layer, a parameter generation layer, and a feature modulation layer, used to generate pixel-by-pixel modulation parameters:
[0051] Feature extraction layer: Performs convolution operation on the spatial feature map output by the spatial path to extract deep spatial features. The convolution kernel size is 3×3, and the number of output channels is the same as the number of input channels.
[0052] Parameter generation layer: A 1×1 convolutional layer generates pixel-wise scaling parameters γ(h,w) and offset parameters β(h,w) from the extracted spatial features, as shown in the formula. ,in Output features of the spatial path;
[0053] Feature modulation layer: The intermediate features of the Restormer Transformer are corrected using a pixel-by-pixel modulation method. The modulation formula is as follows: ,in This refers to an intermediate feature of a certain layer of the Restormer Transformer.
[0054] Furthermore, in step 3, the wavelet-enhanced Restormer Transformer network is based on a U-shaped structure, containing k downsampling blocks, k upsampling blocks, and one bottleneck block. Each downsampling block, upsampling block, and bottleneck block contains multiple LeWinBlocks as basic building blocks.
[0055] The LeWin Block comprises a wavelet decomposition layer, a multi-head self-attention layer, a wavelet inverse decomposition layer, and a residual connection. The process is as follows: The wavelet decomposition layer uses a two-dimensional discrete wavelet transform to decompose the input features, obtaining low-frequency components LL and high-frequency components LH, HL, and HH; the multi-head self-attention layer performs multi-head self-attention calculation on the low-frequency component LL, and the high-frequency components are concatenated with the attention output of LL after 1×1 convolutional layer channel adjustment; the wavelet inverse decomposition layer performs a two-dimensional discrete wavelet inverse transform on the fused features to restore the feature scale; the residual connection adds the input features and the inversely decomposed features using residual addition.
[0056] Overall workflow of wavelet-enhanced Restormer Transformer network: Initial image without highlights After input, preliminary feature extraction is performed first. Then, the feature scale is gradually reduced and the number of channels is increased through k downsampling blocks. Deep feature extraction is performed through bottleneck blocks. Then, the feature scale is gradually restored through k upsampling blocks. Finally, the residual Δ is repaired by outputting through the output layer.
[0057] Final image generation: Compare the repaired residual Δ with the initial image without highlights. The summation yields the final image T without highlights, calculated using the following formula: ,in This represents the residual weighting coefficient.
[0058] The present invention also provides a polarization-guided specular removal system, including a processor and a memory, wherein the memory is used to store program instructions, and the processor is used to call the program instructions in the memory to execute the polarization-guided specular removal method as described in the above technical solution.
[0059] Compared with the prior art, the beneficial effects of the present invention are:
[0060] 1. This invention constructs strong prior constraints based on the physical characteristics of polarization imaging, and introduces the concept of pseudo diffuse reflection into the field of highlight removal for the first time. By extracting multi-dimensional physical features of polarization images, it provides a physical basis for highlight detection and restoration, effectively avoiding the problems of blurring and artifacts caused by traditional methods that rely solely on RGB information.
[0061] 2. A two-stage network architecture was designed. In the first stage, the residual diffusion model was used to accurately predict the specular residual, which reduced the difficulty of directly generating images without specular highlights. In the second stage, the specular residual was used for fine-grained repair, which achieved accurate detection of the specular region and effective calibration of the pseudo-diffuse region.
[0062] 3. A three-way encoder is proposed, which learns polarization features according to physical semantics through separate paths. This avoids the problem of low learning efficiency caused by simple feature concatenation, allowing each encoder to focus on learning its own corresponding physical semantics and improving the effectiveness of feature extraction.
[0063] 4. A dual-path conditional recovery network is constructed, which integrates global semantics and spatial detail polarization features to achieve progressive constraints from local to global. At the same time, a wavelet-enhanced Transformer is used to effectively reduce the computational load of high-resolution image processing, preserve the high-frequency details of the image, and solve the problem of blur restoration by traditional Transformer.
[0064] 5. Modules such as HybridInjector, TimestepGate, and dual-modal spatial Mask were designed to achieve efficient integration of polarization condition features with the network, dynamically adjust the condition injection intensity, and precisely constrain the feature injection area, which not only ensures the repair effect of the highlight area, but also avoids damage to the clean texture of the background. Attached Figure Description
[0065] Figure 1 This is a flowchart illustrating the overall technical process of the present invention.
[0066] Figure 2 This is a feature fusion architecture diagram of the three-channel encoder and HybridInjector of the present invention;
[0067] Figure 3 This is a diagram of the time-step dynamic injection architecture for the polarization features of this invention.
[0068] Figure 4 This is a diagram illustrating the complete injection architecture of the dual-path polarization feature of this invention. Detailed Implementation
[0069] To better understand the technical solution of the present invention, the specific embodiments of the present invention will be further described below with reference to the accompanying drawings. The core innovation of the polarization-guided specular removal method provided in the embodiments of the present invention lies in the fusion of the physical prior of polarization imaging and a two-stage deep learning architecture, which solves the pain points of traditional specular removal techniques such as blurring, obvious artifacts, boundary distortion, and inaccurate specular detection. It achieves accurate positioning of specular areas and high-quality restoration of image details, and is suitable for various application scenarios with specular interference, such as industrial inspection, machine vision, and film and television post-production. The implementation details and technical principles of each step are described in detail below with reference to the specific process.
[0070] Referring to Figure 1, an embodiment of the present invention provides a polarization-guided specular removal method, which includes the following steps:
[0071] (1) Acquire the RGB image M with highlights and the polarization image P. Use a DoFP polarization camera to acquire image data and recover the linear polarization degree DoLP, polarization angle AoLP, and diffuse reflection intensity from the polarization image P. Physical characteristics are used to calculate weight coefficients w and construct a multi-dimensional polarization feature set.
[0072] (2) A polarization-guided residual diffusion model was constructed, which included a three-way encoder, a HybridInjector module, a TimestepGate module, a dual-modal space Mask generation module, and a pre-trained SD1.5 Unet network. The three-way encoder was used to extract multi-dimensional physical features of the polarization image P through path feature extraction to obtain multi-scale conditional features. The HybridInjector module was used to inject the conditional features into the SD1.5 Unet network, and the TimestepGate module was used to dynamically adjust the conditional injection intensity. The dual-modal space Mask generation module was used to constrain the feature injection area. Finally, the specular residual map was accurately predicted by combining the specular RGB image M with specular highlights. ;
[0073] (3) Based on specular residual map Preliminary image without highlights was obtained through calculation. A dual-path conditional recovery network was constructed, comprising a PriorEncoder, a SpatialConditionEncoder, an HIM module, an SFT module, and a wavelet-enhanced Restormer Transformer network; this network was used to process initially non-highlight images. The input is fed into the RestormerTransformer network, where global prior features and spatial detail features are extracted by the PriorEncoder and SpatialConditionEncoder respectively. These features are then injected into the network via the HIM and SFT modules. The polarization physical features are fused to perform fine calibration and repair on the pseudo-diffuse reflection region in the initial specular-free image, resulting in a high-quality specular-free image T.
[0074] The various stages of the overall implementation process are progressive and collaborative, ensuring that the final output is a restored image free of specular interference, with intact textures and a natural appearance. The specific implementation process is as follows:
[0075] The data acquisition and polarization feature extraction in step (1) serve as the foundation of the entire specular removal process. Its core objective is to acquire high-quality dual-modal input data and extract polarization features with strong physical priors, providing reliable support for the training and inference of subsequent network models.
[0076] In practice, a DoFP (Division of Focal Plane) polarization camera is used to simultaneously acquire RGB images M (with highlights) and polarization images P. Compared to traditional time-division and angle-division polarization acquisition devices, the core advantage of this camera is that it can simultaneously capture light intensity information in four key polarization directions (0°, 45°, 90°, and 135°) in a single exposure. This effectively avoids dual-modal image registration errors caused by dynamic changes in the shooting scene (such as slight object movement or fluctuations in light intensity), ensuring spatial consistency between the RGB and polarization images from the source, and laying the foundation for subsequent deep fusion of polarization and RGB image features. During the actual acquisition process, camera shooting parameters need to be optimized: the exposure time is adjusted to 10-20ms, and the ISO sensitivity is set to 100-200 to reduce image noise interference, ensure that the acquired images are clear and detailed, and guarantee the accuracy of polarization angle acquisition, providing comprehensive data support for polarization feature extraction.
[0077] From the polarization image P, the linear degree of polarization (DoLP), polarization angle (AoLP), and diffuse reflection intensity are accurately recovered based on the principles of polarization optics. These core physical quantities are strongly correlated with the reflective properties of the object's surface and are key physical priors for specular highlight removal. They are also the core advantage of this invention, distinguishing it from traditional specular highlight removal methods that rely solely on RGB images. Specifically, the linear polarization degree (DoLP) ranges from [0,1] and describes the degree of light polarization; a value closer to 1 indicates stronger polarization, and a value closer to 0 indicates weaker polarization. The polarization angle (AoLP) ranges from [0,π) and reflects the vibration direction of polarized light, closely related to the object's surface material and reflection angle. Diffuse reflection intensity... It is the intensity of the diffuse reflection component of the object's surface obtained by separating it through the principle of polarization imaging. It can truly reflect the texture and color characteristics of the object's surface and is not affected by specular highlights.
[0078] The raw polarization image P acquired by the DoFP polarization camera contains light intensity information in four polarization directions (0°, 45°, 90°, 135°). The linear polarization degree DoLP and diffuse reflection intensity are calculated using the following formulas. And weighting coefficient w:
[0079]
[0080]
[0081]
[0082]
[0083]
[0084] in, , , , These represent the light intensity values in the four polarization directions. and α and β are the maximum and minimum polarization light intensities, respectively; k is the diffuse reflection compensation coefficient, with a value range of [0.1, 0.3]; α and β are weight adjustment parameters, with α ranging from [5, 10] and β ranging from [0.3, 0.5]; DoLP ranges from [0, 1], and the closer the value is to 1, the stronger the polarization characteristics of the region, and the more likely it is to be a specular region. The diffuse component of the image is used to characterize the color tone after highlight removal; w is the polarization feature weight coefficient, which is used to adjust the contribution of polarization features in subsequent networks. This weight coefficient is normalized to the [0,1] interval by the Sigmoid function or truncation function.
[0085] From a physical perspective, specular highlight areas exhibit strong polarization due to the specular reflection properties of light, resulting in significantly higher DoLP values compared to diffuse reflection areas. This physical difference provides a reliable basis for the initial distinction between highlight and diffuse reflection areas, helping the network to quickly locate highlight areas and reduce interference from invalid features. In contrast, pseudo-diffuse reflection areas, which exhibit texture distortion and color shift after initial highlight removal, have fundamentally different polarization characteristics from true diffuse reflection areas. Specifically, the DoLP and AoLP distributions of pseudo-diffuse reflection areas are chaotic and cannot match the material properties of the object's surface, and the transition between their polarization characteristics and those of the surrounding true diffuse reflection areas is unnatural. This characteristic serves as a core criterion for subsequent refined restoration, helping the network accurately identify and calibrate pseudo-diffuse reflection areas, avoiding issues such as texture blurring and color shift after restoration.
[0086] Due to the extracted DoLP, Equal physical quantities have different dimensions (DoLP values are [0,1]). The values range from [0, 255]. Directly inputting these physical quantities into the network for training can lead to an imbalance in the learning weights for different features, affecting training effectiveness and model performance. Therefore, these physical quantities need to be normalized using a min-max normalization process to uniformly normalize them to the [0, 1] interval, eliminating the influence of dimensional differences. The normalization formula is: x_norm = (x - x_min) / (x_max - x_min), where x is the original physical quantity value, x_min is the minimum value of the physical quantity, x_max is the maximum value of the physical quantity, and x_norm is the normalized value. The normalized DoLP... The physical quantities, together with the RGB image M with highlights, are used as input to the subsequent network model, providing strong physical constraints for highlight residual prediction and pseudo-diffuse reflection calibration. This effectively avoids problems such as blurry restoration and inaccurate highlight detection caused by traditional methods that rely solely on RGB image information, thereby improving the robustness and restoration accuracy of the model.
[0087] Step 2 involves building a polarization-guided residual diffusion model (first stage), with the core task of accurately predicting the specular residual map. (The difference in pixel values between the highlight region and the original non-highlight region in the image) lays a solid foundation for the second-stage pseudo-diffuse reflection calibration. Traditional highlight removal methods often directly generate non-highlight images, which are difficult to learn. However, this invention uses a residual prediction method to transform the highlight removal task into a highlight residual prediction task, significantly reducing the learning difficulty. At the same time, it integrates the powerful generation capability of the diffusion model and polarization physics prior, effectively improving the prediction accuracy of the highlight residual. Specific implementation details are as follows:
[0088] like Figure 2 As shown, firstly, a three-path Feature-wise Linear Modulation (FILM) encoder is designed to specifically process the physical features of the polarization image P, addressing the problem of existing methods simply concatenating all polarization features, resulting in low network learning efficiency and weak feature representation ability. Based on the physical semantics of polarization features and object reflection characteristics, the polarization features are explicitly divided into three independent paths: HighlightPath, GhostPath, and ColorPath. Each path is learned by an independent multi-scale convolutional encoder, ensuring that the paths do not interfere with each other. This allows each encoder to focus on learning its corresponding physical semantics, improving the targeting and effectiveness of feature extraction.
[0089] HighlightPath is primarily responsible for the accurate detection of strong highlight areas, with the input feature being [W]. pos [, DoLP,AoLP, w], where W pos The specular position weight map is constructed based on DoLP, and the calculation formula is W. pos = DoLP αα is a learnable parameter with an initial value of 2.0; AoLP is the polarization angle. This corresponds to the physical semantics of "high polarization degree → strong polarization → specular reflection." Since the polarization degree of strong highlight regions is significantly higher than other regions, the above input features can enhance the representation ability of polarization degree features for strong highlight regions, enabling HighlightPath to accurately extract the position, contour, and intensity features of highlight regions, providing accurate position and intensity information for subsequent prediction of highlight residuals. The weight coefficient w is initially set to 0.5 and can be adaptively adjusted during network training to optimize feature representation and ensure detection accuracy for highlight regions of different intensities.
[0090] GhostPath is primarily responsible for detecting ghosting and weakly reflective regions, compensating for the neglect of weakly reflective regions by strong highlight detection. The input features are [DoLP, 1-localSSIM(M, )], where localSSIM is the local structural similarity index, and the input feature is [W ssim , F wavelet ], where W ssim The ghost detection weight map is constructed based on local structural similarity, and the calculation formula is W. ssim = 1 - localSSIM(M, ), localSSIM is the local structural similarity index; F wavelet This represents the high-frequency energy features extracted through wavelet transform; corresponding to the physical semantics of "structural dissimilarity + high-frequency energy → deformation region". Specifically, localSSIM(M, Used to calculate the specular image M and diffuse intensity The local structural similarity, with values ranging from [0,1], is 1-localSSIM(M, The larger the value of ), the greater the difference in local structure between the two, and the more likely it is to be a ghost or a weakly reflective area. The high-frequency energy feature is used to capture high-frequency details in the image. Ghosts and weakly reflective areas usually have high high-frequency energy. By fusing these two features, GhostPath can accurately detect ghosts and weakly reflective areas in the image, avoiding such areas from being misjudged as normal diffuse reflective areas, thereby improving the completeness of specular residual prediction.
[0091] ColorPath is primarily responsible for providing color priors and tonal constraints for the image after specular removal, preventing color distortion in the restored image. Its input is diffuse reflection intensity. This corresponds to the physical meaning of the "diffuse reflection approximation". Diffuse reflection intensity. It can reflect the true color and texture characteristics of an object's surface through ColorPath. Feature learning can provide a reliable color reference for subsequent specular residual prediction and image inpainting, ensuring that the colors of the inpainted image are consistent with the true colors of the object and improving the visual naturalness of the image.
[0092] Each path's multi-scale convolutional encoder consists of 4 convolutional blocks and 2 pooling layers. It extracts feature maps at four scales (H / 8, H / 16, H / 32, H / 64, where H is the original image height) through progressive downsampling. Each scale corresponds to different levels of feature information: low-scale feature maps (H / 8) primarily capture detailed image features (such as highlight boundaries and texture details); mid-scale feature maps (H / 16, H / 32) primarily capture local structural features; and high-scale feature maps (H / 64) primarily capture global semantic features (such as the overall distribution of highlight areas). The feature maps from each path are concatenated only during the injection into the SD1.5 Unet to form multi-scale conditional features. This ensures the independence and specificity of each path during the feature learning stage, effectively improving the learning efficiency and expressive power of polarization features and avoiding interference between features from different paths.
[0093] Secondly, a HybridInjector module is designed to inject the multi-dimensional conditional features extracted by the three encoders into the pre-trained SD1.5 Unet, providing strong physical constraints for the denoising process of the diffusion model and solving the problem of low prediction accuracy and lack of physical priors in the prediction of hyperspectral residuals in traditional diffusion models. Figure 2 As shown, the HybridInjector module includes a Feature Linear Modulation (FiLM) submodule, a Hierarchical Integration (HIM) submodule, and a weight fusion layer, which are used to achieve efficient fusion of conditional features and SD1.5 Unet network features. This module integrates two complementary mechanisms, FiLM feature modulation and HIM (Hybrid InjectionModule) cross-attention, and adopts differentiated injection strategies for the deep and shallow feature characteristics of SD1.5 Unet, taking into account both accurate localization of highlight regions and effective modulation of global features.
[0094] The FiLM submodule performs a linear transformation on the stitched multi-scale conditional features to generate scale parameter γ and offset parameter β, using the formula... The intermediate features of the SD1.5 Unet network are modulated, where This is an intermediate feature of a certain layer in the SD1.5 Unet network. These are the characteristics after FiLM modulation;
[0095] The HIM submodule employs a cross-attention mechanism to achieve interactive fusion of conditional features and Unet features. First, multi-scale conditional features are mapped to a query vector Q, a key vector K, and a value vector V through a linear layer. Then, Unet intermediate features are mapped to the target query vector Q'. This is achieved through the formula... and Calculate the cross-attention output, where Let be the dimension of the key vector, Attention be the attention mechanism, and Softmax be the soft-maximum normalization function. To output the projection weight matrix, Concat represents the channel dimension concatenation operation; F HIM The output characteristics of the HIM submodule;
[0096] Weighted Fusion Layer: This layer fuses the outputs of the FiLM and HIM submodules using learnable weight parameters w'. The weight parameters w' are adaptively learned by the network and range from [0,1]. The fusion formula is as follows: F Hybrid This refers to the output characteristics of the hybrid injector.
[0097] SD1.5 Unet, as the core feature extraction and denoising module of the diffusion model, primarily captures detailed image information (such as highlight boundaries and texture details) in its shallow features and globally relevant semantic information (such as the overall distribution of highlight regions) in its deep features. Based on this characteristic, the shallow layer injection (input convolutional layer conv_in, downsampling layer down0, downsampling layer down1) of the HybridInjector module adopts a fusion approach of "FiLM + HIM": the HIM module, through a cross-attention mechanism, accurately captures the precise spatial location of highlight and ghost regions, providing precise positional constraints for the denoising process; the FiLM module performs channel-wise linear modulation on the features, strengthening the fusion of polarization features and image features, enabling the network to better utilize polarization physical priors for highlight residual prediction. Deep layer (downsampling layer down2, intermediate layer mid, upsampling layer up0, upsampling layer up1, upsampling layer up2, upsampling layer up3) injection uses only the FiLM module for global feature modulation. This is mainly because the number of channels in deep feature maps is large (e.g., the number of channels in an H / 64 scale feature map is usually 1024). If the cross-attention calculation of the HIM module is used, it will lead to a sharp increase in computation and even memory overflow (OOM) problems, affecting the stable training of the network. The FiLM module can effectively control the amount of computation while ensuring the global feature modulation effect.
[0098] like Figure 3As shown, feature injection covers all nine key locations of the SD1.5 Unet: conv_in → down0 → down1 → down2 → mid → up0 → up1 → up2 → up3. Compared to the traditional ControlNet method of injecting features only in the encoder part, this achieves denser conditional feature guidance, enabling polarization physical priors to participate in the denoising process of the diffusion model throughout, significantly improving the prediction accuracy of the diffusion model for specular residuals. Simultaneously, the HybridInjector module fuses the output features of FiLM and HIM through learnable weights w (w∈[0,1]) to form the final conditional enhancement features. The weights w are adaptively learned by the network during training, requiring no manual setting. They can automatically balance the contributions of the FiLM and HIM mechanisms according to different image scenes, achieving optimal injection of polarization features and improving the model's adaptability.
[0099] Furthermore, a TimestepGate module is designed to dynamically adjust the injection intensity of multi-scale conditional features, addressing the significant differences in polarization guidance requirements at different time steps during diffusion denoising. This ensures appropriate polarization guidance is provided at each denoising stage, improving the accuracy and stability of specular residual prediction. The denoising process of the diffusion model is a gradual recovery from a noisy image to a clean image. Different time steps t correspond to different noise levels: in the early stage of diffusion (larger time step t, more noise in the image), the specular regions in the image are masked by noise, requiring strong polarization guidance to accurately locate the specular regions and avoid specular detection deviations caused by noise interference; in the late stage of diffusion (smaller time step t, the image is nearly clean), the specular regions in the image are basically revealed, requiring weak polarization guidance to avoid image artifacts caused by excessive constraints on polarization features, releasing the detail refining capabilities of SD1.5 Unet, and ensuring the accuracy of specular residual details.
[0100] The TimestepGate module includes a timestep encoding layer, a gate coefficient generation layer, and an intensity adjustment layer, used to dynamically adjust the conditional injection intensity for different diffusion time steps.
[0101] Time-step coding layer: Sine coding is performed on the diffusion time step t to convert the discrete time step into a continuous feature vector. The coding formula is as follows: and ,in The encoding feature dimension has a value of 128, and i is the feature dimension index, with a value range of [0, ..., 128]. / 2-1];
[0102] Gating coefficient generation layer: The encoded time-step features are input into a two-layer fully connected network, and the gating coefficients g are generated by the Sigmoid activation function, as shown in the formula. ,in and It is a fully connected layer, ReLU (Rectified Linear Unit) is the rectified linear unit, PE(t) is the time-step sine code, b2 is the bias term, σ is the sigmoid activation function, and the gating coefficient g ranges from [0,1].
[0103] Intensity Adjustment Layer: The gating coefficient g is applied to the scale parameter γ, offset parameter β of the FiLM submodule, and the attention weight of the HIM submodule, respectively, using the following adjustment formula: , and ,in The adjusted scale parameters, The adjusted offset parameter is 'Attention', and the adjusted attention output is 'Attention'. When the diffusion time step t = 999 (pure noise state), the gating coefficient g ≈ 1, achieving strong polarization guidance. When the diffusion time step t = 0 (close to clean image state), the gating coefficient g ≈ 0, releasing the detail refining capability of SD1.5Unet.
[0104] Furthermore, a bimodal spatial mask is constructed to constrain the injection process of polarization features, preventing polarization features from disrupting the clean texture of the image background during injection. Precise modulation is applied only to reflective regions (highlights, ghosting, and pseudo-diffuse reflection regions). The bimodal spatial mask generation module includes a ghosting detection layer, a mask generation layer, and a mask modulation layer, used to constrain the injection region of the conditional features.
[0105] Ghost detection layer: Defines ghost detection metrics , where localSSIM is the local structural similarity index calculated based on an 11×11 sliding window, with a value range of [0,1]. The closer the value is to 1, the greater the structural difference in the region, and the more likely it is to be a ghost region.
[0106] Mask generation layer: based on DoLP and Construct a specular mask and a ghost mask separately, specifically a specular mask. Ghost mask and merge mask Where k1 and k2 are slope parameters, both of which are learnable parameters, with an initial value of 5; th1 and th2 are threshold parameters, with th1 ranging from [0.4, 0.6] and th2 ranging from [0.3, 0.5]; σ is the Sigmoid activation function; and max represents the operation of taking the maximum value pixel by pixel.
[0107] Mask modulation layer: The merged Mask mergeThe modulation formula is applied to the scale parameter γ and offset parameter β of the FiLM submodule. and This ensures that multi-scale conditional features only affect highlight and ghosting regions, among which... These are the scale parameters after modulation. This refers to the modulated offset parameter.
[0108] The dual-modal spatial mask generation module achieves precise positioning and constraint of the reflection area through three masks: Mask highlight Based on the linear polarization degree (DoLP) detection of strong highlight regions, this method utilizes the physical principle that specular reflection has strong polarization characteristics. The difference between DoLP and the threshold th1 is mapped to a mask value in the range [0,1] using the Sigmoid function. The closer the value is to 1, the more likely the region is to be a strong highlight. ghost Based on the structural similarity difference W ssim This method detects ghosting and weakly reflective areas, compensating for DoLP's insensitivity to weak polarization phenomena, by comparing the specular image M with the pseudo-diffuse reflection image I. diff By observing local structural differences, deformation regions with low polarization degree but abnormal structure were identified; Mask merge The first two masks are merged by taking the maximum value pixel by pixel to form a complete reflective area mask, ensuring that both strong highlights and weak reflections can be accurately identified. When the three work together, Mask merge The scale parameter γ and offset parameter β applied to the FiLM submodule ensure that the polarization condition features modulate only the reflective regions, while leaving the clean texture regions of the background intact. This achieves precise region constraint injection and avoids polarization guidance from damaging the image quality of non-reflective regions.
[0109] Finally, the specular residual map is predicted and the model is trained. The training process and loss function design of the polarization-guided residual diffusion model are as follows:
[0110] Data preprocessing: Compare the image M with highlights and the true image without highlights. Normalization is performed to the [-1,1] interval, and the true specular residual is calculated. ;
[0111] Latent Spatial Mapping: Real Spectral Residual Input the pre-trained VAE encoder to obtain the latent representation. The VAE encoder consists of 3 convolutional layers and 2 fully connected layers, with a convolutional kernel size of 4×4, a stride of 2, and an output latent feature dimension of 512.
[0112] Noise injection: Following the noise scheduling strategy of the diffusion process, the potential representation is... Adding Gaussian noise yields noisy latent representations at different time steps. The noise injection formula is: Where ε is standard Gaussian noise, This is the cumulative product coefficient for the diffusion process;
[0113] Denoising training: training the noisy latent representation The features of the high-brightness image M and the polarization condition features are input into the SD1.5Unet network, which is the denoising diffusion implicit model. The network output predicts the noise. The polarization condition characteristics specifically include:
[0114] (a) Original physical polarization characteristics: DoLP (degree of linear polarization), AoLP (angle of polarization), I diff (diffuse reflection intensity), w (weighting coefficient);
[0115] (b) Derived auxiliary features: W pos (Highlight position weight), W ssim (Ghost detection weights), F wavelet (High-frequency features of wavelets);
[0116] (c) Encoded multi-scale features: After the above features are processed by three encoders (HighlightPath, GhostPath, ColorPath), conditional features of four scales (H / 8, H / 16, H / 32, H / 64) are output. These features are injected into nine key positions of SD1.5 Unet through the HybridInjector module to provide physical constraints for diffusion denoising.
[0117] Loss function: A composite loss function combining mean squared error loss (MSE) and weighted structural similarity loss (WSSIM) is used, and the formula is as follows: ,in For the predicted specular residual map, and To balance the weights, The value is 1.0. The value is 0.3; For mean square error loss, The weighted structural similarity loss is used.
[0118] In step (3), a dual-path conditional recovery network is built (second stage), as follows: Figure 4 As shown, the core task is to create a specular residual map based on the predictions from the first stage. A dual-path conditional recovery network is used to finely calibrate and repair the pseudo-diffuse reflection region generated after the initial specular removal, ultimately outputting a high-quality specular-free image T. This stage employs a residual-aware learning paradigm, simplifying the learning objective to "correcting the residual error of the first stage," significantly reducing the learning difficulty of the network. Simultaneously, it integrates polarization physics priors and a dual-path feature encoding structure to improve the calibration accuracy of the pseudo-diffuse reflection region. The specific implementation is as follows:
[0119] First, a preliminary specular removal operation is performed by comparing the specular residual image M (with highlights) with the specular residual image predicted in the first stage. By subtracting, a preliminary image without highlights is obtained. The calculation formula is: ,in , where is the scaling factor, ranging from [0.95, 1.05], used to fine-tune the contribution intensity of the specular residual. This operation achieves initial removal of the specular region through simple pixel-level subtraction, which is simple to operate, computationally efficient, and can quickly eliminate most specular interference in the image. However, due to the slight error in the specular residual map predicted in the first stage, and the insufficiently smooth transition between the specular and diffuse regions, this operation will produce pseudo-diffuse regions in the image, mainly manifested as texture blurring, color shift, and boundary distortion. Nevertheless, the initial specular removal operation has significantly reduced the difficulty of subsequent restoration tasks, allowing the dual-path conditional recovery network to focus on the fine calibration of the pseudo-diffuse regions, rather than learning the complex mapping relationship of specular removal from scratch, thus improving the network's learning efficiency and restoration accuracy.
[0120] Secondly, a dual-path polarization coding structure combining the global path (Prior Path) and the spatial path (Spatial Path) is designed to address the problem that existing methods only use global prior injection and cannot capture the precise spatial location of specular highlights. This achieves the synergistic effect of global semantic constraints and spatial detail constraints. Furthermore, the two paths work together in each layer (BasicLayerWithSFT) of the wavelet-enhanced Restormer Transformer to ensure that the polarization prior guides the entire restoration process.
[0121] Global path: Input fusion features are [ DoLP The fused feature has 12 channels; the PriorEncoder consists of 4 convolutional layers, batch normalization layers, ReLU activation functions, and global average pooling layers. The convolutional kernel size is 3×3, the stride is 2, and the number of output channels is 64, 128, 256, and 512, respectively. After global average pooling, a global feature vector with a dimension of 512 is obtained, which is then mapped to global prior features through two fully connected layers. Where B is the batch size;
[0122] Spatial path: Input is a specular residual map The SpatialConditionEncoder adopts a U-shaped structure, which includes 4 downsampling blocks and 4 upsampling blocks. The downsampling blocks use convolutional layers to achieve feature dimensionality reduction and extraction, while the upsampling blocks use transposed convolutional layers to recover the feature scale. The number of output channels for each downsampling block and upsampling block are 48, 96, 192, and 384, respectively. Finally, it outputs spatial feature maps at 4 scales, namely (H / 2, W / 2), (H / 4, W / 4), (H / 8, W / 8), and (H / 16, W / 16).
[0123] Feature injection order: Dual-path features are injected into the Restormer Transformer network in the order of "first SFT spatial modulation → then HIM global injection → finally Transformer block feature processing".
[0124] The Prior Path is primarily responsible for providing global semantic constraints, ensuring that the repair of pseudo-diffuse reflection regions conforms to the true reflection characteristics of the object's surface. Its input is the fused features. DoLP [, w], a total of 12 channels, including the specular residual map predicted in the first stage. (3 channels), Linear polarization degree DoLP (1 channel), Diffuse reflection intensity The system uses 1 channel for the fused features, learnable weight coefficients w (1 channel), and supplementary channels (6 channels) obtained through channel duplication to ensure that the number of channels in the fused features matches the input requirements of the subsequent encoder. The fused features are then encoded into global prior features by the PriorEncoder. (Where B is the batch size, 16 is the number of feature map channels, and 256 is the feature map size), PriorEncoder consists of multi-scale convolutions and channel attention modules. The multi-scale convolutions use three convolutional blocks with kernel sizes of 3×3, 5×5, and 7×7, used to extract global features at different scales. The channel attention module is used to enhance the expression of important features and suppress redundant information. Global prior features By injecting the HIM module into the Restormer Transformer image inpainting network in a cross-attention manner, global semantic constraints such as high light intensity distribution, overall polarization law, and reflection type statistics are provided for the inpainting process, ensuring that the inpainting of pseudo-diffuse reflection areas conforms to the real reflection characteristics of the object surface.
[0125] The Spatial Path is primarily responsible for providing spatial detail constraints, enabling precise localization and calibration of pseudo-diffuse regions. Its input is the specular residual image S, which is processed by the SpatialConditionEncoder to extract spatial feature maps at four scales (48 / 96 / 192 / 384 dimensions). The SpatialConditionEncoder consists of four stacked convolutional blocks, each containing one convolutional layer, one batch normalization (BN) layer, and one ReLU activation function. All convolutional kernels are 3×3 with a stride of 2. Different scales of spatial features are extracted through progressive downsampling; the 48-dimensional feature map corresponds to low-scale detail features, while the 384-dimensional feature map corresponds to high-scale global spatial features. These spatial feature maps are injected pixel-by-pixel into the Restormer Transformer via the Spatial Feature Transformer (SFT) module, preserving the precise location, boundary shape, and texture details of the specular region, thus achieving precise localization and calibration of the pseudo-diffuse region. The injection of dual-path features follows the order of "SFT spatial modulation → HIM global injection → Transformer block feature processing" to achieve progressive constraints from local to global, ensuring that local repair is consistent with the global scene.
[0126] A Spatial Feature Transform (SFT) layer is proposed to replace the traditional global FiLM modulation method. It generates pixel-by-pixel scaling and offset parameters to achieve accurate correction of highlight areas and pseudo-diffuse reflection areas, solving the problem that the traditional global modulation method cannot adapt to local feature differences and thus leads to low repair accuracy.
[0127] The SFT module includes a feature extraction layer, a parameter generation layer, and a feature modulation layer, used to generate pixel-by-pixel modulation parameters:
[0128] Feature extraction layer: Performs convolution operation on the spatial feature map output by the spatial path to extract deep spatial features. The convolution kernel size is 3×3, and the number of output channels is the same as the number of input channels.
[0129] Parameter generation layer: A 1×1 convolutional layer generates pixel-wise scaling parameters γ(h,w) and offset parameters β(h,w) from the extracted spatial features, as shown in the formula. ,in The output feature of the spatial path is (h,w), which represents the spatial coordinates of a pixel in the image.
[0130] Feature modulation layer: The intermediate features of the Restormer Transformer are corrected using a pixel-by-pixel modulation method. The modulation formula is as follows: ,in This refers to an intermediate feature of a certain layer of the Restormer Transformer.
[0131] To prevent the SFT module from destroying the feature representation ability of the pre-trained Restormer Transformer in the early stage of training, the scaling parameter γ of the SFT module is initialized with 1 and the offset parameter β is initialized with 0. This causes the SFT module to degenerate into an identity mapping in the early stage of training. The network only learns spatial modulation capability gradually during training, ensuring that the highlight area receives a strong correction signal, the non-highlight area retains the original texture, and the highlight boundary area achieves a smooth transition.
[0132] The Hierarchical Integration (HIM) module (with the same structure as the HIM sub-module in the first stage; the HIM in the first stage guides the diffusion model to predict specular residuals, while the HIM in the second stage guides the inpainting network to calibrate pseudo-diffuse reflections) includes a feature mapping layer, a cross-attention layer, and a feature fusion layer, used to inject global prior features into the Restormer Transformer network.
[0133] Feature mapping layer: maps global prior features Mapped to query vector Q through a linear layer p Key vector K p Sum vector V p Intermediate features of Restormer Transformer Mapped to the target query vector Q t ;
[0134] Cross-attention layer: Calculates the cross-attention between global prior features and intermediate features of the Transformer, using the following formula: ,in The dimension of the key vector;
[0135] Feature fusion layer: The cross-attention output is concatenated with the intermediate features of the Transformer along the channel dimension, and then fused through a 1×1 convolutional layer. The formula is as follows: ;
[0136] The HIM module performs feature injection in each SFT-enhanced BasicLayerWithSFT of the Restormer Transformer.
[0137] A wavelet-enhanced Restormer Transformer is used as the core feature extraction and reconstruction module of the dual-path conditional restoration network. Its core unit is the wavelet-enhanced LeWin Blocks, which integrates the advantages of wavelet transform and Transformer, effectively solving the problems of excessive computation and blurred restoration results when traditional Transformer processes high-resolution images. The wavelet transform uses the Haar wavelet basis, with a decomposition layer of 3, decomposing image features into low-frequency and high-frequency components: the low-frequency components mainly contain the global structure and color information of the image, while the high-frequency components mainly contain the detailed information of the image (such as edges, textures, and highlight boundaries). By processing the low-frequency and high-frequency components separately, the computational cost of high-resolution image processing is significantly reduced while preserving the high-frequency details of the image. The Transformer's self-attention mechanism can capture long-distance dependencies in the image, effectively repairing large areas of pseudo-diffuse reflection in the image and ensuring global consistency of the restoration results. The initial image without highlights is then processed. In the input U-shaped Restormer Transformer, dual-path polarization features are dynamically injected into each LeWin Block via SFT and HIM modules. Global prior features extracted from the global path are injected through the HIM module using cross-attention, providing global semantic constraints such as specular intensity distribution and overall polarization patterns for the restoration process. Spatial feature maps extracted from the spatial path are injected through the SFT module using pixel-by-pixel modulation, providing spatial information such as the precise location of specular highlights, boundary shapes, and texture details for the restoration process. These two types of features work collaboratively in each LeWin Block in the order of "SFT spatial modulation → HIM global injection → Transformer block feature processing," achieving a progressive constraint from local to global. This ensures that the restoration of pseudo-diffuse reflection areas meets both local detail requirements and maintains global scene consistency, thus enabling polarization physical priors to guide the entire restoration process.
[0138] The wavelet-enhanced Restormer Transformer network is based on a U-shaped structure and consists of 4 downsampling blocks, 4 upsampling blocks, and 1 bottleneck block. Each block uses the LeWin Block as its basic building block.
[0139] The LeWin Block comprises a wavelet decomposition layer, a multi-head self-attention layer, a wavelet inverse decomposition layer, and a residual connection. The process is as follows: ① The wavelet decomposition layer uses a two-dimensional discrete wavelet transform to decompose the input features, obtaining low-frequency components LL and high-frequency components LH, HL, and HH; ② The multi-head self-attention layer performs multi-head self-attention calculation on the low-frequency component LL, and the high-frequency components are concatenated with the attention output of LL after 1×1 convolutional layer channel adjustment; ③ The wavelet inverse decomposition layer performs a two-dimensional discrete wavelet inverse transform on the fused features to restore the feature scale; ④ The residual connection adds the input features and the inversely decomposed features using residual addition.
[0140] Overall network process: Initial image without highlights After input, the initial non-highlight image Tinit (3 channels) is converted into initial features (48 channels) through a 3×3 convolutional layer. The feature scale is gradually reduced and the number of channels is increased through 4 downsampling blocks. Deep feature extraction is performed through a bottleneck block. The feature scale is then gradually restored through 4 upsampling blocks. Finally, the features (48 channels) are converted into repair residual Δ (3 channels) through a 3×3 convolutional layer.
[0141] Final image generation: Compare the repaired residual Δ with the initial image without highlights. The summation yields the final image T without highlights, calculated using the following formula: ,in is the residual weighting coefficient, with a value range of [0.8, 1.2].
[0142] The loss function of the dual-path conditional recovery network adopts a composite loss function that combines MSE loss, perceptual loss, and structural similarity loss, as shown in the formula: ,in For the predicted image without highlights, A true image without highlights. , , To balance the weights, The value is 1.0. The value is 0.5. The value is set to 0.2; the perceptual loss is calculated by extracting deep features from the pre-trained VGG16 network to calculate the MSE loss, and the structural similarity loss is obtained by calculating the structural similarity index between the predicted image and the real image.
[0143] To verify the effectiveness and superiority of the method of the present invention, a systematic comparative experiment was conducted to evaluate the performance of the method of the present invention from a quantitative perspective and to compare it with the current mainstream highlight removal methods. At the same time, the necessity of each core component was verified through ablation experiments. The specific implementation is as follows:
[0144] The experimental results are shown in Tables 1 and 2. The complete framework of this invention (Ours (Full)) is optimal in all metrics; the performance of the first-stage model is also better than most baselines, verifying the effectiveness of the polarization-guided residual diffusion module. Ablation experiments show that removing the WSSIM loss, wavelet transform, and HIM module in the first stage all result in varying degrees of performance degradation; removing any core component in the second stage also leads to performance fluctuations, with the degradation being more pronounced after removing the HIM module, verifying the necessity of each component.
[0145] Table 1 Results of the ablation experiment of the core components in the first stage
[0146]
[0147] Table 2 Results of the second-stage core component ablation experiment
[0148]
[0149] On the other hand, embodiments of the present invention also provide a polarization-guided specular removal system, including a processor and a memory, wherein the memory is used to store program instructions, and the processor is used to call the program instructions in the memory to execute the polarization-guided specular removal method as described in the above technical solution.
[0150] Although embodiments of the present invention have been described above in conjunction with the accompanying drawings, the present invention is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and instructive, and not restrictive. Those skilled in the art can make many other forms based on the guidance of this specification and without departing from the scope of protection of the claims of the present invention, and these forms all fall within the scope of protection of the present invention.
Claims
1. A method for removing specular highlights based on polarization guidance, characterized in that, Includes the following steps: Step 1: Acquire the RGB image M with highlights and the polarization image P. Recover the multi-dimensional physical polarization features from the polarization image P, including the linear polarization degree DoLP, the polarization angle AoLP, and the diffuse reflection intensity. and the weighting coefficient w; Step 2: Construct a polarization-guided residual diffusion model, which includes a three-path feature-based linear modulation encoder, a hybrid injection module, a time-step gating module, a dual-modal spatial mask generation module, and a pre-trained SD1.5 Unet network. The three-path feature-based linear modulation encoder extracts multi-dimensional physical polarization features from the polarization image P through path-specific feature extraction, obtaining multi-scale conditional features. The hybrid injection module injects these multi-scale conditional features into the SD1.5 Unet network, and the time-step gating module dynamically adjusts the conditional injection intensity. The dual-modal spatial mask generation module constrains the feature injection region. Finally, the model is combined with a specular RGB image M to accurately predict the specular residual map. ; Step 3, based on the specular residual map Preliminary image without highlights was obtained through calculation. A dual-path conditional reconstruction network was constructed, which includes a dual-path polarization coding structure, a spatial feature transformation module, a hierarchical integration module, and a wavelet-enhanced Restormer Transformer network; this network was used to process preliminary images without highlights. The input is fed into the RestormerTransformer network, where global prior features and spatial detail features are extracted through a dual-path polarization coding structure. The global prior features are then injected into the RestormerTransformer network via a hierarchical ensemble module using a cross-attention approach, while the spatial feature maps are injected into the RestormerTransformer network via a pixel-by-pixel modulation approach using a spatial feature transformation module. This process refines and repairs the pseudo-diffuse reflection regions in the initial specular-free image, resulting in a high-quality specular-free image.
2. The polarization-guided specular removal method as described in claim 1, characterized in that: In step 1, the original polarization image P acquired by the DoFP polarization camera contains light intensity information in four polarization directions: 0°, 45°, 90°, and 135°. The linear polarization degree DoLP and diffuse reflection intensity are calculated using the following formulas. And weighting coefficient w: ; ; ; ; ; in, , , , These represent the light intensity values in the four polarization directions. and α and β are the maximum and minimum polarization light intensities, respectively; k is the diffuse reflection compensation coefficient, with a value range of [0.1, 0.3]; α and β are weight adjustment parameters, with α ranging from [5, 10] and β ranging from [0.3, 0.5]; and DoLP ranging from [0, 1]. Used to characterize the diffuse component of an image, providing constraints on the hue after specular removal; w is the polarization feature weighting coefficient.
3. The polarization-guided specular removal method as described in claim 1, characterized in that: In step 2, the three-path feature-based linear modulation encoder divides the polarization features into three independent paths according to physical semantics. Each path consists of an input layer, multiple multi-scale coding blocks, and a feature normalization layer. The three paths do not interfere with each other and focus on learning the features corresponding to their physical semantics. High-light channel path: Input feature is [W pos [, DoLP, AoLP, w], where W pos The specular position weight map is constructed based on DoLP, and the calculation formula is W. pos = DoLP α α is a learnable parameter, and AoLP is the polarization angle; focusing on feature extraction and detection in strong highlight regions, the multi-scale coding block consists of a 3×3 convolutional layer, a batch normalization layer, a ReLU activation function, and a max pooling layer; Ghost channel path: Input features are [DoLP, 1-localSSIM(M, )], where localSSIM is the local structural similarity index, and the input feature is [W ssim , F wavelet ], where W ssim The ghost detection weight map is constructed based on local structural similarity, and the calculation formula is W. ssim = 1 - localSSIM(M, ), localSSIM is the local structural similarity index; F wavelet To extract high-frequency energy features through wavelet transform, we focus on feature extraction and detection of ghost / weak reflection regions; The structure of the multi-scale coding block is consistent with the hyperspectral channel path, and the number of output channels is the same, ensuring feature scale matching; Color channel path: Input feature is It provides color priors and tone constraints after specular removal; the multi-scale coding block adds an adaptive color adjustment layer after the convolutional layer, which optimizes the color expression of diffuse components by learning color mapping parameters; The three paths output features at multiple scales respectively. When injected into the SD1.5 Unet network, the features at the corresponding scales of the three paths are concatenated along the channel dimension to form fused conditional features.
4. The polarization-guided specular removal method as described in claim 1, characterized in that: Step 2's hybrid injection module includes a feature linear modulation submodule, a hierarchical integration submodule, and a weight fusion layer, used to achieve efficient fusion of conditional features and SD1.5Unet network features. Feature linear modulation submodule: Performs a linear transformation on the stitched fusion conditional features to generate scale parameter γ and offset parameter β, using the formula... The intermediate features of the SD1.5 Unet network are modulated, where This is an intermediate feature of a certain layer in the SD1.5 Unet network. These are the characteristics after FiLM modulation; The hierarchical integration submodule employs a cross-attention mechanism to achieve the interactive fusion of conditional features and Unet features. First, the fused conditional features are mapped through a linear layer to a query vector Q, a key vector K, and a value vector V. Then, the SD1.5 Unet intermediate features are mapped to the target query vector Q'. This is achieved through the formula... and Calculate the cross-attention output, where Let be the dimension of the key vector, Attention be the attention mechanism, and Softmax be the soft-maximum normalization function. To output the projection weight matrix, Concat represents the channel dimension concatenation operation, F HIM Output characteristics of hierarchical integrated submodules; Weighted Fusion Layer: This layer fuses the outputs of the feature linear modulation submodule and the hierarchical ensemble submodule using learnable weight parameters w'. The weight parameters w' are adaptively learned by the network and range from [0,1]. The fusion formula is as follows: .
5. The polarization-guided specular removal method as described in claim 4, characterized in that: Step 2's time-step gating module includes a time-step encoding layer, a gating coefficient generation layer, and an intensity adjustment layer, used to dynamically adjust the conditional injection intensity for different diffusion time steps. Time step coding layer: Performs sinusoidal coding on the diffusion time step t, converting the discrete time step into a continuous feature vector; Gating coefficient generation layer: The encoded time-step features are input into a two-layer fully connected network, and the gating coefficients g are generated by the Sigmoid activation function, as shown in the formula. ,in and It is a fully connected layer, ReLU is a modified linear unit, PE(t) is a time-step sinusoidal code, b2 is a bias term, σ is a sigmoid activation function, and the gating coefficient g ranges from [0,1]. Intensity Adjustment Layer: The gating coefficient g is applied to the scale parameter γ, offset parameter β of the characteristic linear modulation submodule, and the attention weights of the hierarchical integration submodule, respectively. The adjustment formula is as follows: , and 'Attention' is the adjusted attention output.
6. The polarization-guided specular removal method as described in claim 1, characterized in that: Step 2's dual-modal spatial mask generation module includes a ghost detection layer, a mask generation layer, and a mask modulation layer, used for the injection region of constraint features: Ghost detection layer: Defines ghost detection metrics , where localSSIM is the local structural similarity index calculated based on a sliding window, and its value ranges from [0,1]; Mask generation layer: based on DoLP and Construct the specular mask, ghost mask, and merge mask separately. Specifically, the specular mask... Ghost mask and merge mask where k1 and Th1 and Th2 are slope parameters, all of which are learnable parameters. σ is the Sigmoid activation function, and max represents the operation of taking the maximum value pixel by pixel. Mask modulation layer: This will merge the... The modulation formula is as follows: The scaling parameter γ and the offset parameter β act on the characteristic linear modulation submodule. and This makes the conditional features only affect the highlight and ghost areas.
7. The polarization-guided specular removal method as described in claim 1, characterized in that: The training process and loss function design of the polarization-guided residual diffusion model in step 2 are as follows: Data preprocessing: The image M with highlights and the true image T without highlights are normalized to the interval [-1,1], and the true highlight residual is calculated. ; Latent Spatial Mapping: Real Spectral Residual Input the pre-trained VAE encoder to obtain the latent representation. ; Noise injection: Following the noise scheduling strategy of the diffusion process, the potential representation is... Adding Gaussian noise yields noisy latent representations at different time steps. The noise injection formula is: Where ε is standard Gaussian noise, This is the cumulative product coefficient for the diffusion process; Denoising training: training the noisy latent representation The features of the highlight image M and the polarization condition features are input into the SD1.5 Unet network, and the output is the predicted noise. ; Loss function: A composite loss function combining MSE loss and WSSIM loss is used, and the formula is as follows. ,in For the predicted specular residual map, and To balance the weights; For mean square error loss, The weighted structural similarity loss is used.
8. The polarization-guided specular removal method as described in claim 1, characterized in that: In step 3, the dual-path polarization coding structure of the dual-path conditional recovery network includes a global path and a spatial path, with the two paths operating in parallel: Global path: Input fusion features are [ DoLP [, w], global prior features are obtained through a prior encoder. The prior encoder consists of multiple convolutional layers, batch normalization layers, ReLU activation functions, and global average pooling layers. After global average pooling, a global feature vector is obtained, which is then mapped to global prior features through two fully connected layers. Spatial path: Input is a specular residual map The spatial conditional encoder outputs multi-scale spatial feature maps. The spatial conditional encoder adopts a U-shaped structure, which includes multiple downsampling blocks and multiple upsampling blocks. The downsampling blocks achieve feature dimensionality reduction and extraction through convolutional layers, while the upsampling blocks use transposed convolutional layers to restore the feature scale.
9. The polarization-guided specular removal method as described in claim 1, characterized in that: Step 3, the spatial feature transformation module, includes a feature extraction layer, a parameter generation layer, and a feature modulation layer, used to generate pixel-by-pixel modulation parameters: Feature extraction layer: Performs convolution operation on the spatial feature map output by the spatial path to extract deep spatial features. The convolution kernel size is 3×3, and the number of output channels is the same as the number of input channels. Parameter generation layer: A 1×1 convolutional layer generates pixel-wise scaling parameters γ(h,w) and offset parameters β(h,w) from the extracted spatial features, as shown in the formula. ,in Output features of the spatial path; Feature modulation layer: The intermediate features of the Restormer Transformer network are corrected using a pixel-wise modulation method. The modulation formula is as follows: ,in These are intermediate features of a certain layer in the Restormer Transformer network; In step 3, the wavelet-enhanced Restormer Transformer network is based on a U-shaped structure, containing k downsampling blocks, k upsampling blocks, and one bottleneck block. Each downsampling block, upsampling block, and bottleneck block contains multiple LeWin Blocks as basic building blocks. The LeWin Block consists of a wavelet decomposition layer, a multi-head self-attention layer, a wavelet inverse decomposition layer, and residual connections. The process is as follows: the wavelet decomposition layer uses two-dimensional discrete wavelet transform to decompose the input features, obtaining low-frequency component LL and high-frequency components LH, HL, and HH; the multi-head self-attention layer performs multi-head self-attention calculation on the low-frequency component LL, and the high-frequency components are concatenated with the attention output of LL after being adjusted by a 1×1 convolutional layer. The wavelet inverse decomposition layer performs a two-dimensional discrete wavelet inverse transform on the fused features to restore the feature scale. Residual connections add the input features to the features after inverse decomposition using residuals; Overall workflow of wavelet-enhanced Restormer Transformer network: Initial image without highlights After input, preliminary feature extraction is performed first. Then, the feature scale is gradually reduced and the number of channels is increased through k downsampling blocks. Deep feature extraction is performed through bottleneck blocks. Then, the feature scale is gradually restored through k upsampling blocks. Finally, the residual Δ is repaired by outputting through the output layer. Final image generation: Compare the repaired residual Δ with the initial image without highlights. The summation yields the final image T without highlights, calculated using the following formula: ,in This represents the residual weighting coefficient.
10. A polarization-guided specular removal system, characterized in that: It includes a processor and a memory, the memory being used to store program instructions, and the processor being used to call the program instructions in the memory to execute the polarization-guided specular removal method as described in any one of claims 1-9.