A Material-Aware Spatial Intelligent Reflection Generation Network Construction Method
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING FEIDU TECH CO LTD
- Filing Date
- 2026-03-09
- Publication Date
- 2026-06-30
AI Technical Summary
Existing reflection generation technologies lack modeling of material physical properties, making it difficult to continuously control the clarity of reflections and resulting in a lack of physical consistency in the generated results. This makes it difficult to meet the high consistency and controllability requirements of space intelligent systems.
A spatial intelligent reflection generation network based on material perception is constructed. By using an explicit material perception model and a frequency domain gating mechanism, combined with cross-scale feature fusion, the network parameters are optimized to achieve continuous controllability of reflection sharpness and physical consistency of the generation results.
It achieves explicit material perception and physical interpretability of the reflection generation process, continuous controllability of reflection sharpness, and the generation results are applicable to downstream tasks of space intelligent systems, such as robot environmental perception and autonomous driving simulation.
Smart Images

Figure CN121809588B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a method for constructing a spatial intelligent reflection generation network based on material perception, belonging to the fields of spatial intelligence and computer vision technology. Background Technology
[0002] Current research has made some progress in computer vision reflection modeling, image generation, and spatial scene reconstruction. However, existing technologies still have significant shortcomings in unified modeling of reflections of complex real materials, controllable generation of reflection sharpness, and deep coupling of physical properties with the generation process, making it difficult to support the high consistency and high generalization visual generation requirements of real-world intelligent systems for diverse reflection phenomena.
[0003] The reflection generation process lacks a material-aware modeling mechanism: existing reflection generation methods are mostly based on the ideal mirror assumption or fixed reflection model. They usually regard the reflective surface as a perfect mirror with high reflectivity and low roughness, and only focus on geometric symmetry or viewpoint mapping relationship, without explicitly modeling the physical properties of the surface material. Such methods have difficulty distinguishing different reflective media such as mirrors, metals, water surfaces, floor reflections and wet road surfaces, resulting in insufficient physical consistency of the generated results in real complex scenes.
[0004] The methods for controlling reflection sharpness are crude and lack continuous adjustability: Some existing technologies attempt to simulate blurred reflections through post-processing blurring, convolution kernel degradation, or random noise injection. However, these methods usually occur after the result is generated, and cannot finely control the reflection details during the generation process. Furthermore, there is no clear mapping relationship between the degree of blur and the physical properties of the real material, making it difficult to achieve continuous and interpretable control of reflection sharpness.
[0005] Lack of a closed-loop mechanism from physical property modeling to result verification: Current reflection generation methods usually focus on visual similarity indicators while ignoring the consistency verification between the generated results and the physical properties of the material. This results in some generated results being visually reasonable but lacking credibility at the physical level, making it difficult to directly apply to application scenarios with high realism requirements, such as robot perception, autonomous driving, or spatial intelligent simulation.
[0006] Current technologies lack a spatial intelligent reflection generation method that can explicitly introduce material physical properties at the generative model structure level and adjust reflection sharpness and detail distribution through controllable mechanisms. Especially in complex, multi-material, and multi-object real-world spatial scenes, existing solutions struggle to simultaneously meet the comprehensive requirements of physical consistency, generative controllability, and model generalization ability. Therefore, there is an urgent need to propose a material-aware spatial intelligent reflection generation network structure to achieve unified modeling and controllable generation of physical properties of different reflective materials, providing more realistic and reliable reflection perception and visual generation capabilities for spatial intelligent systems. Summary of the Invention
[0007] To address the shortcomings of existing technologies, the present invention aims to provide a method for constructing a spatial intelligent reflection generation network based on material perception, which seeks to solve the problems of insufficient modeling of the physical properties of reflective surface materials, difficulty in continuously controlling the clarity of reflection, and lack of physical consistency in the generation results in existing reflection generation technologies.
[0008] To achieve the above objectives, the present invention provides the following technical solution: a method for constructing a material-aware spatial intelligent reflection generation network includes:
[0009] Obtain the original image set and preprocess the images in the original image set; perform material annotation on the preprocessed images to determine the material type corresponding to each region in the image, and extract low-level features and high-level semantic features related to the material in the image, and then construct an explicit material-aware model.
[0010] Select the input image, construct a spatial intelligent reflection generation network with an encoder-decoder structure, and introduce a material-aware frequency domain gating mechanism between the encoder and decoder. Based on the material probability distribution obtained from the explicit material-aware model, design a frequency domain gating function to weight different frequency components, perform inverse fast Fourier transform on the weighted frequency domain features, and then introduce a cross-scale feature fusion mechanism in the spatial intelligent reflection generation network to perform parallel convolution operations and stitching fusion on the feature map output by the encoder to obtain the expected reflection image of the input image.
[0011] The simulated reflection image of the input image is obtained from the original image set; the expected reflection image is compared with the simulated reflection image, and the geometric consistency and frequency domain consistency indices between them are calculated to obtain a comprehensive score; the model parameters in the spatial intelligent reflection generation network are optimized based on the comprehensive score.
[0012] Furthermore, the steps for constructing an explicit material-aware model are as follows:
[0013] The images in the original image set are preprocessed and labeled with materials to determine the material type corresponding to each region in the image and extract the material-related features in the image.
[0014] Treat each pixel as an independent node, and construct the edges of the probability graph based on the spatial relationships of the pixels in the image:
[0015] Randomly select a pixel and connect it to its four adjacent pixels (up, down, left, and right) to form an undirected edge of a four-neighborhood structure.
[0016] By introducing Conditional Random Fields (CRF), we use the low-level features corresponding to each pixel as observations and the material labels corresponding to each pixel as annotations. We then model the joint probability distribution among these features using CRF.
[0017] Define an energy function, which consists of two parts: a data term and a smoothing term.
[0018] Data item: Obtain the low-level features corresponding to each pixel, and match the extracted low-level features of each pixel with the feature template corresponding to the predefined material label:
[0019] Based on the existing material labels in the image, select standard images for each type of material and use these standard images as template images.
[0020] If the material label of pixel a is: Material Label A; extract the low-level features of the template image corresponding to Material Label A as reference features;
[0021] The low-level features and reference features of pixel a are transformed into feature vectors. Then, the similarity between the low-level features and reference features of pixel a is calculated and used as the value of the corresponding data item of pixel a.
[0022] Furthermore, the subsequent construction steps of the explicit material-aware model are as follows:
[0023] Smoothing term: The value of the smoothing term is determined based on the low-level features of any two adjacent pixels.
[0024] The low-level features of any two adjacent pixels are converted into feature vectors, and a smoothing term judgment threshold and two smoothing terms of different sizes are set as fixed values.
[0025] Calculate the low-level feature similarity between any two adjacent pixels using Euclidean distance;
[0026] If the low-level feature similarity between any two adjacent pixels is less than or equal to the smoothing term judgment threshold, then the smoothing term for those adjacent pixels takes a smaller fixed value; if the similarity is greater than the smoothing term judgment threshold, then the smoothing term for those adjacent pixels takes a larger fixed value.
[0027] Weights are assigned to the smoothing term, an optimization algorithm is selected to minimize the value of the energy function, and the energy function is iteratively updated to obtain the material probability distribution for each pixel.
[0028] Based on the selected optimization algorithm, the material labels of pixels are adjusted iteratively to gradually reduce the value of the energy function. In each iteration, the energy function is recalculated based on the low-level features of the current pixel and the material labels of its neighboring pixels, and the material label of the current pixel is updated until the energy function is minimized, thus obtaining the probability value of each pixel for different material labels and forming a material probability distribution. Based on this probability distribution, the most likely material label of each pixel is determined, realizing explicit perceptual modeling of materials in the image.
[0029] Furthermore, the construction steps of the spatial intelligent reflection generation network are as follows:
[0030] The overall architecture of the Spatial Intelligent Reflection Generation Network is an encoder-decoder structure, and the Spatial Intelligent Reflection Generation Network is trained using the original image set.
[0031] The encoder adopts a U-Net-like structure, consisting of alternating layers of convolutional and pooling layers;
[0032] In each layer of the encoder, the convolutional layer uses a 3×3 convolutional kernel and the ReLU activation function to increase the network depth, enabling the network to parse the environmental information of the input image.
[0033] The pooling layer uses max pooling with a stride of 2 to reduce the spatial resolution of the feature map, enabling the network to capture more global information. Through multiple convolutional and pooling operations, the encoder gradually maps the features of the input image to a low-dimensional space and outputs a low-dimensional feature map.
[0034] The decoder gradually recovers the encoded low-dimensional features into a high-resolution reflection image;
[0035] The decoder consists of alternating deconvolutional or upsampling layers and convolutional layers; the decoder fuses features from corresponding layers of the encoder through skip connections, specifically:
[0036] In each layer of the decoder, the feature map from the same layer of the encoder is concatenated with the feature map of the current decoding layer, and then the feature is fused by convolution.
[0037] A material-aware frequency-domain gating mechanism is introduced between the encoder and the decoder. This design utilizes the energy distribution characteristics of different materials in the frequency domain to perform frequency-domain weighting on the feature map output by the encoder, so as to highlight the frequency information related to the material and suppress irrelevant noise.
[0038] Furthermore, the implementation steps of the frequency domain gating mechanism are as follows:
[0039] Step a1: Perform FFT transformation on the feature map output by the encoder;
[0040] The feature map output by the encoder from the input image is used as the original feature map; the size of the original feature map is H×W×C; where H is the height, W is the width, and C is the number of channels;
[0041] The FFT operation transforms the feature map of each channel from the spatial domain to the frequency domain, obtaining the complex matrix corresponding to each channel. Each complex matrix contains amplitude spectrum and phase spectrum information.
[0042] The amplitude spectrum represents the energy distribution of the feature map at different frequencies, while the phase spectrum records the phase information of each frequency component of the feature map.
[0043] Step a2: Design the gating function;
[0044] Based on the explicit material perception model, output the probability distribution of different material categories for each pixel in the input image;
[0045] The probability distribution of each pixel is represented as a vector, which represents the probability that each pixel is judged as a different material category; then, based on the material probability distribution, a gating function is designed to weight the frequency domain features to obtain the weighted amplitude spectrum.
[0046] Keeping the phase spectrum corresponding to each complex matrix unchanged, the weighted amplitude spectrum and phase spectrum corresponding to each complex matrix are transformed by inverse FFT to obtain the quasi-feature map;
[0047] The quasi-feature map is fused with the original feature map, and the fused feature map is used as the input feature of the decoder.
[0048] Furthermore, the subsequent implementation steps of the frequency domain gating mechanism are as follows:
[0049] A cross-scale feature fusion mechanism is introduced into the spatial intelligent reflection generation network. This mechanism extracts feature maps from different levels of the encoder before the decoder works, and then fuses the feature maps from different levels of the encoder.
[0050] Feature maps are extracted at different levels of the encoder. For each level of feature map, parallel convolution operations are performed using convolution kernels of different scales to obtain feature information at different scales.
[0051] Based on the specific spatial resolution of the feature maps at each scale, feature alignment is performed on feature maps at different scales;
[0052] The feature maps of different scales after alignment are spliced and fused along the channel dimension to ensure that the features of different scales are semantically consistent, resulting in cross-scale joint features. A consistency constraint loss function is introduced during the splicing process. This loss function measures the similarity between features of different scales. By minimizing this loss function, the network is forced to keep the multi-scale features consistent in the semantic space before they enter the decoder.
[0053] Furthermore, the specific working steps of the decoder are as follows:
[0054] The decoder consists of alternating deconvolutional layers or upsampling layers and convolutional layers;
[0055] At each layer of the decoder, features from the corresponding layer of the encoder are fused via skip connections:
[0056] The feature map of the corresponding layer of the encoder is concatenated with the feature map of the current layer of the decoder after deconvolution or upsampling in the channel dimension.
[0057] The feature map size of the encoder's d-th layer is: H2×W2×C3;
[0058] The feature map size of the d-th layer of the decoder is also H2×W2×C4;
[0059] After splicing, the feature map size becomes: H2×W2×(C3+C4);
[0060] After each deconvolution or upsampling and feature fusion, convolutional layers are used to refine the features:
[0061] Initialize the kernel parameters corresponding to the convolutional layer, slide the kernel on the feature map input by the encoder with a fixed stride; at each position, perform element-wise multiplication between the feature map region covered by the kernel and the weight matrix of the kernel, then add the product results, add the bias term to obtain the convolution output value at each position, and then input the convolution output value at each position into the activation function for nonlinear transformation.
[0062] The feature maps that have undergone multiple deconvolution, upsampling, feature fusion and convolution operations are concatenated and weighted, and the decoder finally outputs the fused multi-channel feature map.
[0063] The decoder maps the multi-channel feature map into a reflection image through a 1×1 convolution operation;
[0064] Obtain the input image without an encoder and use it as the target image; separate the non-reflection scene from the target image.
[0065] Furthermore, the steps for separating the non-reflective scene from the target image are as follows:
[0066] The gradient operator is used to calculate the gradient components of each pixel in the target image in the horizontal and vertical directions; based on the gradient components of each pixel in the horizontal and vertical directions, the gradient magnitude of each pixel is calculated to obtain the gradient map.
[0067] In the gradient map, regions with large gradient changes are temporarily designated as reflective regions, while regions with stable or small gradient changes are temporarily designated as non-reflective background regions.
[0068] Construct the objective function using data fidelity terms and regularization terms;
[0069] Data fidelity items:
[0070] Calculate the difference between the target image I and the reflective region R and the non-reflective background region S to obtain... ;calculate The square of the corresponding L2 norm is divided by 2 to obtain the data fidelity term.
[0071] Furthermore, the subsequent steps for separating the non-reflective scene from the target image are as follows:
[0072] The regularization terms are divided into regularization terms for reflective regions and regularization terms for non-reflective background regions;
[0073] Introducing regularization parameters λ for reflective regions and non-reflective background regions. (R) and λ (S) ;
[0074] The gradient of the reflection region R is calculated by performing a convolution operation using the gradient operator. ,calculate The L1 norm, multiplied by λ (R) This yields the regularization term for the reflection region;
[0075] The gradient operator is used to perform a convolution operation on the non-reflective background region S to calculate the gradient of the reflective region. , then calculate The second derivative ;calculate The square of the corresponding L2 norm, multiplied by λ (S) This yields the regularization term for the non-reflective background region;
[0076] Using gradient descent as the optimization algorithm, the reflective region R and the non-reflective background region S are updated in the opposite direction of the gradient of the objective function, gradually reducing the value of the objective function until the objective function reaches the maximum number of iterations or the change in the objective function is less than 1×e. -5 This yields the non-reflective background region of the target image;
[0077] The non-reflective background area is filtered and contrast is enhanced using histogram equalization. The result is then fused with the reflective image output from the decoder to obtain the desired reflective image of the input image.
[0078] Furthermore, the optimization steps for the model parameters are as follows:
[0079] Perform geometric consistency verification on the desired reflection image:
[0080] The process involves: acquiring simulated reflection images of the input images from the original image set; obtaining geometric information related to spatial location and depth from the simulated reflection images; comparing the geometric information in the desired reflection image with that in the simulated reflection image; checking whether the position of the reflecting object in the desired reflection image conforms to the geometric laws of light reflection and whether the depth relationship corresponds; and scoring based on the results of the geometric consistency verification of the desired reflection image, using it as a Class A indicator.
[0081] Next, frequency domain consistency verification is performed: frequency domain analysis is conducted on the expected reflection image and the simulated reflection image to extract the spectral features of both, and the similarity index between their spectra is calculated as a Class B index;
[0082] The weights of Category A and Category B indicators are allocated proportionally, and the comprehensive score of the expected reflectance image is calculated.
[0083] The overall score is compared with a pre-set pass threshold. If the overall score is greater than or equal to the pass threshold, the reflection image is deemed passable. If the overall score is less than the pass threshold, the reflection image is deemed failable. The similarity index between the simulated reflection image and the expected reflection image is calculated. The difference in the similarity index is added as an additional loss term to the loss function of the spatial intelligent reflection generation network. The network is retrained until the reflection image generated by the network meets the requirements.
[0084] Compared with the prior art, the beneficial effects of the present invention are:
[0085] 1. The reflection generation process possesses explicit material perception and physical interpretability: This invention explicitly introduces a material property map into the reflection generation network structure, directly using physical properties such as the roughness of the reflective surface as control variables in the generation process. This makes the changes in the reflection appearance no longer implicit in the model weights, but participate in the generation process as explicit and continuous parameters. Compared with existing end-to-end reflection generation models, this invention achieves an explicit expression of the reflection generation mechanism, and the generation results have clear physical meaning and interpretability, effectively avoiding the uncontrollable risks brought about by "black box" generation.
[0086] 2. Achieving Continuous and Controllable Generation of Reflection Sharpness: This invention introduces a material-aware frequency-domain gating mechanism into the feature transfer path of the diffusion model, allowing reflection sharpness to continuously vary with surface roughness. When the material roughness is low, the model retains more high-frequency reflection details, generating sharp and clear specular reflections; as the material roughness increases, high-frequency information is suppressed step by step, and the reflection naturally transitions to a blurred form. Compared to existing methods that rely on discrete material categories or post-processing blurring, this invention enables refined and controllable adjustment of the degree of reflection blur, significantly improving the flexibility and realism of reflection generation.
[0087] 3. The generated results are more suitable for downstream tasks of space intelligent systems: Since the reflection results generated by this invention maintain a high degree of consistency in geometric structure, material properties and frequency domain characteristics, its output results can be directly used as reliable visual input in space intelligent systems for tasks such as robot environmental perception, autonomous driving simulation, digital twin modeling and augmented reality rendering; compared with reflection generation methods that only pursue visual similarity, this invention has obvious advantages in terms of engineering usability and system compatibility. Attached Figure Description
[0088] Other features, objects, and advantages of the invention will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings:
[0089] Figure 1 This is a schematic diagram of the method of the present invention;
[0090] Figure 2 This is a schematic diagram of the reflection generation network structure of the present invention;
[0091] Figure 3 This is a schematic diagram of step S3 of the present invention. Detailed Implementation
[0092] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0093] Please see Figure 1 and Figure 2 The method for constructing a material-aware spatial intelligent reflection generation network includes:
[0094] Step S1: Obtain the original image set and preprocess the images in the original image set; perform material annotation on the preprocessed images, determine the material type corresponding to each region in the image, extract low-level features and high-level semantic features related to the material in the image, and then construct an explicit material-aware model.
[0095] The specific steps of step S1 (data acquisition and processing) are as follows:
[0096] Collect scene images under different lighting conditions, viewpoints, and material types as the original image set; and preprocess the images in the original image set, including operations such as enhancement, filtering, cropping, and normalization, to unify the size and pixel value range of all images.
[0097] Material annotation is performed on the preprocessed image to determine the material type corresponding to each region in the image, and low-level features and high-level semantic features related to the material are extracted from the image:
[0098] Low-level features are extracted using color, texture, and shape detection algorithms (such as color histogram, local binary pattern (LBP), and Canny operator), while high-level semantic features are extracted using deep learning models (such as VGG and ResNet).
[0099] Constructing an explicit material-aware model:
[0100] Model Description: This model employs a probabilistic graphical model (such as a Conditional Random Field (CRF)), treating pixels in the image as nodes and spatial relationships and material features between pixels as edges. A suitable energy function is defined to describe the interrelationships and spatial constraints between materials. The energy function includes a data term and a smoothing term. The data term measures the degree of matching between pixel features and corresponding material labels, while the smoothing term encourages adjacent pixels to have similar material labels. By minimizing the energy function, the material probability distribution of each pixel is obtained.
[0101] The steps for constructing an explicit material-aware model based on Conditional Random Fields (CRF) are as follows:
[0102] Treating each pixel in the preprocessed image as an independent node, the edges of the probability graph are constructed based on the spatial relationships of the pixels in the image:
[0103] In the preprocessed image, a pixel is randomly selected and connected to its four adjacent pixels (top, bottom, left, and right) to form an undirected edge with a four-neighborhood structure. If the diagonal of the pixel is considered, the pixel is also connected to its corresponding top-left, bottom-left, top-right, and bottom-right pixels to form an undirected edge with an eight-neighborhood structure.
[0104] By introducing Conditional Random Fields (CRF), we use the low-level features corresponding to each pixel as observations and the material labels corresponding to each pixel as annotations. We then model the joint probability distribution among these features using CRF.
[0105] Define an energy function, which consists of two parts: a data term and a smoothing term.
[0106] Data item: Obtain the low-level features corresponding to each pixel, and match the extracted low-level features of each pixel with the feature template corresponding to the predefined material label:
[0107] Based on the existing material labels in the image, select standard images for each type of material and use these standard images as template images.
[0108] If the material label of pixel a is: Material Label A; extract the low-level features of the template image corresponding to Material Label A as reference features;
[0109] The low-level features and reference features of pixel a are transformed into feature vectors. Then, the similarity between the low-level features and reference features of pixel a is calculated (the similarity can be represented by Euclidean distance, cosine similarity, etc.), which is used as the value of the corresponding data item of pixel a.
[0110] The smaller the value of the data item, the higher the degree of matching between the pixel feature and the corresponding material label;
[0111] Smoothing term: The value of the smoothing term is determined based on the low-level features of any two adjacent pixels.
[0112] The low-level features of any two adjacent pixels are converted into feature vectors, and a smoothing term judgment threshold and two smoothing terms of different sizes are set as fixed values.
[0113] Calculate the low-level feature similarity between any two adjacent pixels using Euclidean distance;
[0114] If the low-level feature similarity between any two adjacent pixels is less than or equal to the smoothing term judgment threshold, then the smoothing term for those adjacent pixels takes a smaller fixed value; if the similarity is greater than the smoothing term judgment threshold, then the smoothing term for those adjacent pixels takes a larger fixed value.
[0115] If the low-level feature similarity of each adjacent pixel in the image (i.e., the preprocessed image) is discrete, then the value of the smoothing term corresponding to each adjacent pixel is recalculated using the Sigmoid function.
[0116] Weights are assigned to the smoothing term, an optimization algorithm is selected to minimize the value of the energy function, and the energy function is iteratively updated (common optimization algorithms include gradient descent, iterative conditional mode (ICM) algorithm, etc.):
[0117] Obtain the material probability distribution for each pixel;
[0118] Based on the selected optimization algorithm, the material labels of pixels are adjusted iteratively to gradually reduce the value of the energy function. In each iteration, the energy function is recalculated based on the low-level features of the current pixel and the material labels of its neighboring pixels, and the material label of the current pixel is updated until the energy function is minimized, thus obtaining the probability value of each pixel for different material labels and forming a material probability distribution. Based on this probability distribution, the most likely material label of each pixel is determined, realizing explicit perceptual modeling of materials in the image.
[0119] Step S2: Select an input image (i.e., choose any image from the original image set as the input image), construct a spatial intelligent reflection generation network using an encoder-decoder structure, and introduce a material-aware frequency domain gating mechanism between the encoder and decoder. Based on the material probability distribution obtained from the explicit material-aware model, design a frequency domain gating function to weight different frequency components, perform inverse fast Fourier transform (IFFT) on the weighted frequency domain features, and then introduce a cross-scale feature fusion mechanism in the spatial intelligent reflection generation network to perform parallel convolution operations and splicing fusion on the feature map output by the encoder to obtain the desired reflection image of the input image.
[0120] The specific steps of step S2 (constructing the spatial intelligent reflection generation network) are as follows:
[0121] The overall architecture of the Spatial Intelligent Reflection Generation Network is an encoder-decoder structure, and the Spatial Intelligent Reflection Generation Network is trained using the original image set.
[0122] The encoder progressively compresses and abstracts the visual features of the input to extract key information;
[0123] The encoder adopts a U-Net-like structure, consisting of alternating layers of convolutional and pooling layers;
[0124] In each layer of the encoder, the convolutional layer uses a 3×3 convolutional kernel and the ReLU activation function to increase the non-linear expressive power of the network depth (i.e., the "spatial intelligent reflection generation network"), enabling the network (i.e., the "spatial intelligent reflection generation network") to parse the environmental information of the input image.
[0125] The pooling layer uses max pooling with a stride of 2 to reduce the spatial resolution of the feature map, enabling the network to capture more global information. Through multiple convolutional and pooling operations, the encoder gradually maps the features of the input image to a low-dimensional space and outputs a low-dimensional feature map.
[0126] The decoder gradually recovers the encoded low-dimensional features into a high-resolution reflection image;
[0127] The decoder consists of alternating deconvolutional layers (transposed convolutions) or upsampling layers and convolutional layers. To preserve details lost during encoding, the decoder fuses features from corresponding encoder layers via skip connections. The specific process is as follows:
[0128] In each layer of the decoder, feature maps from the same (or similar) layers of the encoder are concatenated with the feature map of the current decoding layer, and then feature fusion is performed through convolution.
[0129] A material-aware frequency-domain gating mechanism is introduced between the encoder and decoder. This design utilizes the energy distribution characteristics of different materials in the frequency domain to perform frequency-domain weighting on the feature map output by the encoder, so as to highlight the frequency information related to the material and suppress irrelevant noise. The specific implementation steps are as follows:
[0130] Step a1: Perform a Fast Fourier Transform (FFT) on the feature map output by the encoder.
[0131] The feature map output by the encoder from the input image is used as the original feature map; the size of the original feature map is H×W×C; where H is the height, W is the width, and C is the number of channels;
[0132] The FFT operation transforms the feature map of each channel from the spatial domain to the frequency domain, obtaining the complex matrix corresponding to each channel. Each complex matrix contains amplitude spectrum and phase spectrum information.
[0133] The amplitude spectrum represents the energy distribution of the feature map at different frequencies, while the phase spectrum records the phase information of each frequency component of the feature map.
[0134] Step a2: Design the gating function;
[0135] Based on the explicit material perception model, output the probability distribution of different material categories for each pixel in the input image;
[0136] The probability distribution of each pixel is represented as a vector, which represents the probability that each pixel is judged as a different material category; then, based on the material probability distribution, a gating function is designed to weight the frequency domain features to obtain the weighted amplitude spectrum.
[0137] The weighted formula is:
[0138] ;
[0139] Where σ represents the Sigmoid function; W (stu) and b (stu)表示 Learnable parameter matrix and bias vector;
[0140] Keeping the phase spectrum corresponding to each complex matrix unchanged, the weighted amplitude spectrum and phase spectrum corresponding to each complex matrix are transformed by inverse FFT (IFFT) to obtain the quasi-feature map;
[0141] The quasi-feature map is fused with the original feature map, and the fused feature map is used as the input feature of the decoder.
[0142] A cross-scale feature fusion mechanism is introduced into the spatial intelligent reflection generation network. This mechanism extracts feature maps from different levels of the encoder before the decoder works, and then fuses the feature maps from different levels of the encoder.
[0143] Feature maps are extracted at different levels of the encoder. These feature maps have different spatial resolutions and semantic levels. High-frequency / fine-grained features mainly capture the details and edge information of the image, such as the outline and texture of objects. Low-frequency / coarse-grained features reflect the overall structure and semantic information of the image, such as the category and location of objects.
[0144] For each level of feature map, parallel convolution operations are performed using convolution kernels of different scales (common convolution kernel scales include 3×3, 5×5, etc.) to obtain feature information at different scales: small 3×3 convolution kernels are used to extract local information of each level of feature map, and large 5×5 convolution kernels are used to extract contextual information of each level of feature map.
[0145] Based on the specific spatial resolution of the feature maps at each scale, feature maps at different scales are aligned to ensure they have the same size:
[0146] For feature maps with high spatial resolution, downsampling (such as convolution with a stride of 2 or max pooling) is performed to reduce their resolution; for feature maps with low spatial resolution, upsampling (such as bilinear interpolation or transposed convolution) is performed to increase their resolution.
[0147] The feature maps of different scales after alignment are concatenated and fused along the channel dimension to ensure semantic consistency between features of different scales, resulting in cross-scale joint features. A consistency constraint loss function is introduced during the concatenation process. This loss function measures the similarity between features of different scales. By minimizing this loss function, the network is forced to keep the features of multiple scales consistent in the semantic space (the consistency loss function can be cosine similarity or L2 distance).
[0148] Define the decoder's processing flow:
[0149] The decoder consists of alternating deconvolutional layers (transposed convolutions) or upsampling layers and convolutional layers;
[0150] At each layer of the decoder, features from the corresponding layer of the encoder are fused through skip connections:
[0151] The feature map of the corresponding layer of the encoder is concatenated with the feature map of the current layer of the decoder after deconvolution or upsampling in the channel dimension.
[0152] The feature map size of the encoder's d-th layer is: H2×W2×C3;
[0153] The feature map size of the d-th layer of the decoder is also H2×W2×C4;
[0154] After splicing, the feature map size becomes: H2×W2×(C3+C4);
[0155] After each deconvolution or upsampling and feature fusion, convolutional layers are used to refine the features:
[0156] Initialize the kernel parameters corresponding to the convolutional layer, slide the kernel on the feature map input by the encoder with a fixed stride; at each position, perform element-wise multiplication between the feature map region covered by the kernel and the weight matrix of the kernel, then add the product results, add the bias term to obtain the convolution output value at each position, and then input the convolution output value at each position into the activation function for nonlinear transformation.
[0157] If the convolutional layer extracts less feature information, the number of channels is increased, and multiple different convolutional kernels are used to perform convolution operations on the feature map input to the encoder; if the feature map output by the convolutional layer is required to have the same number of channels as the one needed to generate the reflection image, a 1×1 convolution operation is used to reduce the number of channels.
[0158] The feature maps that have undergone multiple deconvolution, upsampling, feature fusion and convolution operations are concatenated and weighted, and the decoder finally outputs the fused multi-channel feature map.
[0159] The decoder maps the multi-channel feature map into a reflection image through a 1×1 convolution operation;
[0160] Obtain the input image without an encoder as the target image; separate the non-reflection scenes from the target image:
[0161] Gradient operators (such as Sobel operator, Prewitt operator, etc.) are used to calculate the gradient components of each pixel in the target image in the horizontal and vertical directions; based on the gradient components of each pixel in the horizontal and vertical directions, the gradient magnitude of each pixel is calculated to obtain a gradient map (the gradient map reflects the degree of drastic change of pixel values in the image).
[0162] In the gradient map, regions with large gradient changes are temporarily designated as reflective regions, while regions with stable or small gradient changes are temporarily designated as non-reflective background regions.
[0163] Construct the objective function using data fidelity terms and regularization terms;
[0164] Data fidelity items:
[0165] Calculate the difference between the target image I and the reflective region R and the non-reflective background region S to obtain... ;calculate Divide the square of the corresponding L2 norm by 2 to obtain the data fidelity term;
[0166] The regularization terms are divided into regularization terms for reflective regions and regularization terms for non-reflective background regions;
[0167] Introducing regularization parameters λ for reflective regions and non-reflective background regions. (R) and λ (S) ;
[0168] The gradient of the reflection region R is calculated by performing a convolution operation using the gradient operator. ,calculate The L1 norm, multiplied by λ (R) This yields the regularization term for the reflection region;
[0169] The gradient operator is used to perform a convolution operation on the non-reflective background region S to calculate the gradient of the reflective region. , then calculate The second derivative ;calculate The square of the corresponding L2 norm, multiplied by λ (S) This yields the regularization term for the non-reflective background region;
[0170] Using gradient descent as the optimization algorithm, the reflective region R and the non-reflective background region S are updated in the opposite direction of the gradient of the objective function, gradually reducing the value of the objective function until the objective function reaches the maximum number of iterations or the change in the objective function is less than 1×e. -5 This yields the non-reflective background region of the target image;
[0171] The non-reflective background area is filtered (e.g., Gaussian filtering, median filtering, etc.), and the contrast is enhanced by histogram equalization. Then, it is fused with the reflective image output by the decoder to obtain the desired reflective image of the input image.
[0172] Step S3: Obtain the simulated reflection image of the input image from the original image set; compare the expected reflection image with the simulated reflection image, calculate the geometric consistency and frequency domain consistency indices between them to obtain a comprehensive score; optimize the model parameters in the "Spatial Intelligent Reflection Generation Network" based on the comprehensive score;
[0173] Please see Figure 3 The specific steps of step S3 are as follows:
[0174] Perform geometric consistency verification (position / depth) on the desired reflectance image:
[0175] Obtain simulated reflection images of the input images from the original image set; extract geometric information related to spatial location and depth from the simulated reflection images, which may include the geometric structure of the scene, the position coordinates of objects, depth maps, etc.
[0176] Compare the geometric information in the "expected reflection image" (such as the position and spatial distribution of the reflecting object) with the geometric information in the simulated reflection image: check whether the position of the reflecting object in the "expected reflection image" conforms to the geometric law of light reflection and whether the depth relationship corresponds; score the results of the geometric consistency verification of the expected reflection image as a Class A indicator;
[0177] Next, frequency domain consistency verification (spectral matching material) is performed: frequency domain analysis is performed on the "expected reflection image" and the simulated reflection image to extract the spectral features of both, and the similarity index (such as correlation coefficient) between the two spectra is calculated as a type B index;
[0178] The weights of Category A and Category B indicators are allocated proportionally, and the comprehensive score of the expected reflectance image is calculated.
[0179] The overall score is compared with a pre-set pass / fail threshold. If the overall score is greater than or equal to the pass / fail threshold, the reflection image is deemed passable. If the overall score is less than the pass / fail threshold, the reflection image is deemed failable. The similarity index (such as structural similarity index (SSIM) or peak signal-to-noise ratio (PSNR)) between the simulated reflection image and the expected reflection image is calculated. The difference in the similarity index is added as an additional loss term to the loss function of the spatial intelligent reflection generation network, and the network is retrained. During the optimization process, an adaptive learning rate adjustment strategy is adopted to dynamically adjust the learning rate according to the changes in error, thereby improving the quality of the reflection image generated by the network until the reflection image generated by the network meets the accuracy requirements of the user or relevant technicians.
[0180] The above formulas are all dimensionless calculations. The formulas are derived from software simulations using a large amount of collected data to obtain the most recent real-world results. The preset parameters in the formulas are set by those skilled in the art according to the actual situation. For example, there are weighting coefficients and proportional coefficients. The values set are to quantify each parameter to obtain a specific value, which is convenient for subsequent comparison. The values of the weighting coefficients and proportional coefficients are only required to not affect the proportional relationship between the parameters and the quantified values.
[0181] Finally, it should be noted that the above-described embodiments are merely specific implementations of the present invention, used to illustrate the technical solutions of the present invention, and not to limit it. The scope of protection of the present invention is not limited thereto. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that any person skilled in the art can still modify or easily conceive of changes to the technical solutions described in the foregoing embodiments within the technical scope disclosed in the present invention, or make equivalent substitutions for some of the technical features; and these modifications, changes, or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be covered within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
Claims
1. A method for constructing a spatial intelligent reflection generation network based on material perception, characterized in that, include: Obtain the original image set and preprocess the images in the original image set; Material annotation is performed on the preprocessed image to determine the material type corresponding to each region in the image, and low-level features and high-level semantic features related to the material are extracted from the image to construct an explicit material-aware model. The steps for constructing an explicit material-aware model are as follows: The images in the original image set are preprocessed and labeled with materials to determine the material type corresponding to each region in the image and extract the material-related features in the image. Treat each pixel as an independent node, and construct the edges of the probability graph based on the spatial relationships of the pixels in the image: Randomly select a pixel and connect it to its four adjacent pixels (up, down, left, and right) to form an undirected edge of a four-neighborhood structure. By introducing a conditional random field, the low-level features corresponding to each pixel are used as observations, and the material labels corresponding to each pixel are used as annotations: Define the energy function based on the data terms and the smoothing term: Data item: Obtain the low-level features corresponding to each pixel, and match the extracted low-level features of each pixel with the feature template: Based on the existing material labels in the image, select standard images of various materials as template images; If the material label of pixel a is: Material Label A; extract the low-level features of the template image corresponding to Material Label A as reference features; The low-level features and reference features of pixel a are transformed into feature vectors. Then, the similarity between the low-level features and reference features of pixel a is calculated and used as the value of the corresponding data item of pixel a. Smoothing term: The value of the smoothing term is determined based on the low-level features of any two adjacent pixels. The low-level features of any two adjacent pixels are converted into feature vectors, and a smoothing term judgment threshold and two smoothing terms of different sizes are set as fixed values. Calculate the low-level feature similarity between any two adjacent pixels; If the low-level feature similarity between any two adjacent pixels is less than or equal to the smoothing term judgment threshold, then the smoothing term for those adjacent pixels takes a smaller fixed value; if the similarity is greater than the smoothing term judgment threshold, then the smoothing term for those adjacent pixels takes a larger fixed value. Weights are assigned to the smoothing term, an optimization algorithm is selected to minimize the value of the energy function, and the energy function is iteratively updated to obtain the material probability distribution for each pixel. Based on the selected optimization algorithm, the material labels of pixels are adjusted iteratively to gradually reduce the value of the energy function. In each iteration, the energy function is recalculated based on the low-level features of the current pixel and the material labels of its neighboring pixels, and the material label of the current pixel is updated until the energy function is minimized, thus obtaining the probability value of each pixel for different material labels and forming a material probability distribution. Based on this probability distribution, the most likely material label of each pixel is determined, realizing explicit perceptual modeling of materials in the image. Select the input image, construct a spatial intelligent reflection generation network with an encoder-decoder structure, and introduce a material-aware frequency domain gating mechanism between the encoder and decoder. Based on the material probability distribution obtained from the explicit material-aware model, design a frequency domain gating function to weight different frequency components, perform inverse fast Fourier transform on the weighted frequency domain features, and then introduce a cross-scale feature fusion mechanism in the spatial intelligent reflection generation network to perform parallel convolution operations and stitching fusion on the feature map output by the encoder to obtain the expected reflection image of the input image. Simulated reflection images of the input images are obtained from the original image set; the desired reflection image is compared with the simulated reflection image, and the geometric consistency and frequency domain consistency indices between them are calculated to obtain a comprehensive score; the model parameters in the spatial intelligent reflection generation network are optimized based on the comprehensive score.
2. The method for constructing a spatial intelligent reflection generation network based on material perception according to claim 1, characterized in that, The construction steps of the spatial intelligent reflection generation network are as follows: The overall architecture of the Spatial Intelligent Reflection Generation Network is an encoder-decoder structure, and the Spatial Intelligent Reflection Generation Network is trained using the original image set. The encoder consists of multiple alternating convolutional and pooling layers; In each layer of the encoder, the convolutional layers use 3×3 convolutional kernels and the ReLU activation function to increase the network depth; The pooling layer uses max pooling. Through multiple convolution and pooling operations, the encoder gradually maps the features of the input image to a low-dimensional space and outputs a low-dimensional feature map. The decoder gradually recovers the encoded low-dimensional features into a high-resolution reflection image; The decoder consists of alternating deconvolutional or upsampling layers and convolutional layers; the decoder fuses features from corresponding layers of the encoder through skip connections; Specific process: In each layer of the decoder, the feature map from the same layer of the encoder is concatenated with the feature map of the current decoding layer, and then the feature is fused by convolution. A material-aware frequency-domain gating mechanism is introduced between the encoder and decoder to perform frequency-domain weighting on the feature map output by the encoder, thereby suppressing irrelevant noise.
3. The method for constructing a spatial intelligent reflection generation network based on material perception according to claim 2, characterized in that, The implementation steps of the frequency domain gating mechanism are as follows: Step a1: Perform FFT transformation on the feature map output by the encoder; The feature map output by the encoder from the input image is used as the original feature map; the size of the original feature map is H×W×C; where H is the height, W is the width, and C is the number of channels; The FFT operation transforms the feature map of each channel from the spatial domain to the frequency domain, obtaining the complex matrix corresponding to each channel. Each complex matrix contains amplitude spectrum and phase spectrum information. Step a2: Design the gating function; Based on the explicit material perception model, output the probability distribution of different material categories for each pixel in the input image; The probability distribution of each pixel is represented as a vector, which represents the probability that each pixel is judged as a different material category; then, based on the material probability distribution, a gating function is designed to weight the frequency domain features to obtain the weighted amplitude spectrum. Keeping the phase spectrum corresponding to each complex matrix unchanged, the weighted amplitude spectrum and phase spectrum corresponding to each complex matrix are transformed by inverse FFT to obtain the quasi-feature map; The quasi-feature map is fused with the original feature map, and the fused feature map is used as the input feature of the decoder. A cross-scale feature fusion mechanism is introduced into the spatial intelligent reflection generation network. This mechanism extracts feature maps from different levels of the encoder before the decoder works, and then fuses the feature maps from different levels of the encoder. Feature maps are extracted at different levels of the encoder. For each level of feature map, parallel convolution operations are performed using convolution kernels of different scales to obtain feature information at different scales. Based on the specific spatial resolution of the feature maps at each scale, feature alignment is performed on feature maps at different scales; The feature maps of different scales after alignment are spliced and fused along the channel dimension to ensure that the features of different scales are semantically consistent, resulting in cross-scale joint features. A consistency constraint loss function is introduced during the splicing process. This loss function measures the similarity between features of different scales. By minimizing this loss function, the network is forced to keep the multi-scale features consistent in the semantic space before they enter the decoder.
4. The method for constructing a spatial intelligent reflection generation network based on material perception according to claim 3, characterized in that, The specific working steps of the decoder are as follows: The decoder consists of alternating deconvolutional layers or upsampling layers and convolutional layers; At each layer of the decoder, features from the corresponding layer of the encoder are fused via skip connections: The feature map of the corresponding layer of the encoder is concatenated with the feature map of the current layer of the decoder after deconvolution or upsampling in the channel dimension. The feature map size of the encoder's d-th layer is: H2×W2×C3; The feature map size of the d-th layer of the decoder is H2×W2×C4; After splicing, the feature map size becomes: H2×W2×(C3+C4); After each deconvolution or upsampling and feature fusion, convolutional layers are used to refine the features: Initialize the kernel parameters corresponding to the convolutional layer, slide the kernel on the feature map input by the encoder with a fixed stride; at each position, perform element-wise multiplication between the feature map region covered by the kernel and the weight matrix of the kernel, then add the product results, add the bias term to obtain the convolution output value at each position, and then input the convolution output value at each position into the activation function for nonlinear transformation. The feature maps that have undergone multiple deconvolution, upsampling, feature fusion and convolution operations are concatenated and weighted, and the decoder finally outputs the fused multi-channel feature map. The decoder maps the multi-channel feature map into a reflection image through a 1×1 convolution operation; Obtain the input image without an encoder and use it as the target image; separate the non-reflection scene from the target image.
5. The method for constructing a spatial intelligent reflection generation network based on material perception according to claim 4, characterized in that, The steps to separate the non-reflective scene from the target image are as follows: The gradient operator is used to calculate the gradient components of each pixel in the target image in the horizontal and vertical directions; based on the gradient components of each pixel in the horizontal and vertical directions, the gradient magnitude of each pixel is calculated to obtain the gradient map. Based on the magnitude of the gradient changes in the gradient diagram, the reflective region and the non-reflective background region are tentatively determined. Construct the objective function using data fidelity terms and regularization terms; Data fidelity items: Calculate the difference between the target image I and the reflective region R and the non-reflective background region S to obtain... ; calculate Divide the square of the corresponding L2 norm by 2 to obtain the data fidelity term; The regularization terms are divided into regularization terms for reflective regions and regularization terms for non-reflective background regions; Introducing regularization parameters λ for reflective regions and non-reflective background regions. (R) and λ (S) ; The gradient of the reflection region R is calculated by performing a convolution operation using the gradient operator. ,calculate The L1 norm, multiplied by λ (R) This yields the regularization term for the reflection region; The gradient operator is used to perform a convolution operation on the non-reflective background region S to calculate the gradient of the reflective region. The second derivative is then calculated. ;calculate The square of the corresponding L2 norm, multiplied by λ (S) This yields the regularization term for the non-reflective background region; Using gradient descent as the optimization algorithm, the reflective region R and the non-reflective background region S are updated in the opposite direction of the gradient of the objective function, gradually reducing the value of the objective function until the objective function reaches the maximum number of iterations or the change in the objective function is less than 1×e. -5 This yields the non-reflective background region of the target image; The non-reflective background area is filtered and contrast is enhanced using histogram equalization. The result is then fused with the reflective image output from the decoder to obtain the desired reflective image of the input image.
6. The method for constructing a spatial intelligent reflection generation network based on material perception according to claim 1, characterized in that, The optimization steps for model parameters are as follows: Perform geometric consistency verification on the desired reflection image: The process involves: acquiring simulated reflection images of the input images from the original image set; obtaining geometric information related to spatial location and depth from the simulated reflection images; comparing the geometric information in the desired reflection image with that in the simulated reflection image; checking whether the position of the reflecting object in the desired reflection image conforms to the geometric laws of light reflection and whether the depth relationship corresponds; and scoring based on the results of the geometric consistency verification of the desired reflection image, using it as a Class A indicator. Next, frequency domain consistency verification is performed: frequency domain analysis is conducted on the expected reflection image and the simulated reflection image to extract the spectral features of both, and the similarity index between their spectra is calculated as a Class B index; The weights of Category A and Category B indicators are allocated proportionally, and the comprehensive score of the expected reflectance image is calculated. The overall score is compared with a pre-set pass threshold. If the overall score is greater than or equal to the pass threshold, the reflection image is deemed passable. If the overall score is less than the pass threshold, the reflection image is deemed failable. The similarity index between the simulated reflection image and the expected reflection image is calculated. The difference in the similarity index is added as an additional loss term to the loss function of the spatial intelligent reflection generation network. The network is retrained until the reflection image generated by the network meets the requirements.