Priori knowledge guided deep unfolding network based light field image dehazing method and device

CN122243786APending Publication Date: 2026-06-19BEIJING UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING UNIV OF TECH
Filing Date
2026-01-27
Publication Date
2026-06-19

Smart Images

  • Figure CN122243786A_ABST
    Figure CN122243786A_ABST
Patent Text Reader

Abstract

A method and apparatus for dereflecting light field images guided by prior knowledge in a deep unfolding network, the method comprising: (1) inputting a mixed light field image with reflection interference; (2) using a light field disparity estimation module to predict the disparity map of the center view; (3) entering the deep unfolding iteration stage, sending the current features into the denoising module, and using a convolutional neural network to update auxiliary variables to extract local priors; (4) using the intra-view non-local similarity module to extract single-view texture self-similarity features, and using the disparity map to guide the cross-view non-local similarity module to perform feature alignment and aggregation, and extracting cross-view geometric consistency features; (5) sending the denoised auxiliary variables and the outputs of the intra-view and cross-view non-local similarity modules into the reconstruction module, and updating the transmission layer and reflection layer components by gradient descent; (6) repeating (3)-(5) until the preset number of iterations is reached, and reconstructing the final transmission layer features through the decoder to output the light field image after removing reflections.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image processing technology, and more particularly to a method for dereflecting light field images using a priori knowledge-guided depth unrolling networks, and an apparatus for dereflecting light field images using a priori knowledge-guided depth unrolling networks. Background Technology

[0002] With the widespread adoption of digital imaging technology, when imaging a target scene through semi-transparent media such as glass windows, display cases, or water surfaces, reflected light from the media surface often overlaps with transmitted light from the target scene. This ubiquitous reflection phenomenon not only reduces the image's visibility and aesthetics, but more importantly, as a complex additive interference, it severely damages the image's statistical properties, leading to a significant degradation in the performance of subsequent computer vision tasks such as object detection and semantic segmentation. Therefore, accurately separating the transmission and reflection layers from the mixed image is of significant application value for improving imaging quality in complex environments and ensuring the reliability of vision systems.

[0003] In the field of image reflection removal, early traditional methods relied primarily on hand-designed prior assumptions to constrain the solution space, such as using gradient sparsity priors, differences in relative smoothness, or ghosting effects. However, these methods based on specific physical assumptions often struggle to cover the complex and varied reflection patterns in the real world, especially in scenarios where the textures of the transmission and reflection layers are similar or where strong reflections exist, severely limiting the generalization ability of the algorithms. In recent years, deep learning methods, with their powerful feature representation and nonlinear fitting capabilities, have achieved breakthroughs in objective metrics by directly learning the mapping from the mixed image to the transmission layer through the construction of end-to-end convolutional neural networks (CNNs). However, these purely data-driven methods typically exhibit "black box" models, lacking explicit physical interpretability, and often rely on stacking network depth to improve performance, leading to a surge in computational complexity and making it difficult to clearly explain the specific roles of each component within the network in the physical lighting model.

[0004] To balance interpretability and performance, deep unfoldable networks have emerged, mapping the iterative steps of traditional optimization algorithms to a neural network structure. However, most existing unfoldable de-reflection methods are limited to single-image processing, neglecting the rich multi-view information provided by light field imaging technology. Light field cameras can record the direction and position information of light rays in a single exposure, and the resulting multi-view sub-aperture images provide significant parallax cues and depth information, which are crucial for distinguishing between transmission and reflection layers at different depth planes. However, existing light field reflection removal methods typically use multi-view images directly as input, failing to fully utilize the texture self-similarity within views and the prior geometric consistency between different views. Furthermore, existing deep unfoldable networks are mostly designed for single images, making it difficult to handle the cross-view geometric misalignment problem caused by depth differences in light field data. This results in the loss of detail and artifacts in the recovery of transmission layers in complex scenes. Summary of the Invention

[0005] To overcome the shortcomings of existing technologies, the technical problem to be solved by this invention is to provide a light field image dereflection method guided by prior knowledge in a deep unfolded network. This method constructs a dereflection framework that has both the nonlinear fitting capability of deep learning and follows the constraints of physical imaging models, thereby achieving an organic unity of model-driven and data-driven approaches and solving the drawback of traditional "black box" networks lacking interpretability.

[0006] The technical solution of this invention is: a method for dereflecting light field images using a priori knowledge-guided deep unrolling networks, comprising the following steps: (1) Input a mixed light field image with reflection interference; (2) Predict the center view disparity map using the light field disparity estimation module; (3) Enter the deep unfolding iteration stage, send the current features into the denoising module, and use the convolutional neural network to update the auxiliary variables to extract local priors; (4) Extract single-view texture self-similarity features using the intra-view non-local similarity module, and simultaneously use the disparity map to guide the cross-view non-local similarity module to perform feature alignment and aggregation, and extract cross-view geometric consistency features; (5) The denoised auxiliary variables and the output of the non-local similarity module within and across views are sent to the reconstruction module, and the transmission layer and reflection layer components are updated by gradient descent. (6) Repeat steps (3)-(5) until the preset number of iterations is reached. The final transmission layer features are reconstructed by the decoder, and the light field image after reflection is removed is output.

[0007] This invention constructs a dual nonlocal prior, namely, using an intra-view nonlocal similarity module to mine spatial texture self-similarity and a cross-view nonlocal similarity module to capture angular geometric consistency; and introduces a light field parallax estimation module to guide center-side view interaction to solve geometric misalignment; and constructs a dereflection framework that has both deep learning nonlinear fitting capabilities and follows physical imaging model constraints, which can achieve the organic unity of model-driven and data-driven approaches, and solve the drawback of traditional "black box" networks lacking interpretability.

[0008] A priori knowledge-guided light field image dereflection device for deep unfolding networks is also provided, which includes: An input module configured to input a mixed light field image with reflection interference; The light field disparity estimation module is configured to predict the disparity map of the center view; The denoising module is configured to feed the current features into the denoising module when entering the deep unfolding iteration stage, and use the convolutional neural network to update the auxiliary variables to extract local priors. The feature extraction module is configured to use the intra-view non-local similarity module to extract single-view texture self-similarity features, and at the same time use the disparity map to guide the cross-view non-local similarity module to perform feature alignment and aggregation, and extract cross-view geometric consistency features. The reconstruction module is configured to feed the denoised auxiliary variables and the output of the intra-view and cross-view non-local similarity modules into the reconstruction module, and update the transmission layer and reflection layer components by gradient descent. The output module is configured to repeatedly execute the denoising module, feature extraction module, and reconstruction module until a preset number of iterations are reached. The final transmission layer features are reconstructed by the decoder, and the light field image after reflection removal is output. Attached Figure Description

[0009] Figure 1 A model framework diagram of the optical field image dereflection method using a depth unfolding network guided by the prior knowledge of the present invention.

[0010] Figure 2 A flowchart of a light field image dereflection method using a depth unfolding network guided by prior knowledge according to the present invention.

[0011] Figure 3 A flowchart of the non-local similarity module within the view is shown.

[0012] Figure 4 A flowchart of the cross-view nonlocal similarity module is shown.

[0013] Figure 5 A flowchart of the center view enhancement block is shown.

[0014] Figure 6A flowchart of the side view enhancement block is shown.

[0015] Figure 7 The diagram shows a comparison of the visual quality of the method of the present invention with that of a single image, multiple images, and light field dereflection methods on a synthetic dataset.

[0016] Figure 8 The diagram shows a detailed visual comparison of the method of the present invention with the Deep Expandable Network (DExNet) method and the Deep Learning Light Field Dereflection (DMINet) method on real-world and synthetic test sets.

[0017] Figure 9 The average objective enhancement results of 13 contrast methods on synthetic and real datasets are shown. Detailed Implementation

[0018] like Figure 2 As shown, this prior knowledge-guided method for dereflecting light field images using deep unfolded networks includes the following steps: (1) Input a mixed light field image with reflection interference; (2) Predict the center view disparity map using the light field disparity estimation module; (3) Enter the deep unfolding iteration stage, send the current features into the denoising module, and use the convolutional neural network to update the auxiliary variables to extract local priors; (4) Extract single-view texture self-similarity features using the intra-view non-local similarity module, and simultaneously use the disparity map to guide the cross-view non-local similarity module to perform feature alignment and aggregation, and extract cross-view geometric consistency features; (5) The denoised auxiliary variables and the output of the non-local similarity module within and across views are sent to the reconstruction module, and the transmission layer and reflection layer components are updated by gradient descent. (6) Repeat steps (3)-(5) until the preset number of iterations is reached. The final transmission layer features are reconstructed by the decoder, and the light field image after reflection is removed is output.

[0019] This invention constructs a dual nonlocal prior, namely, using an intra-view nonlocal similarity module to mine spatial texture self-similarity and a cross-view nonlocal similarity module to capture angular geometric consistency; and introduces a light field parallax estimation module to guide center-side view interaction to solve geometric misalignment; and constructs a dereflection framework that has both deep learning nonlinear fitting capabilities and follows physical imaging model constraints, which can achieve the organic unity of model-driven and data-driven approaches, and solve the drawback of traditional "black box" networks lacking interpretability.

[0020] Preferably, in step (2), the light field disparity estimation module independently extracts high-dimensional features of each sub-aperture image using a three-layer convolutional network with shared weights; then, in the central view reference coordinate system, different disparity values ​​are assumed for each spatial location within a preset disparity range, and the feature matching cost between the corresponding positions of the central view and each side view is calculated, thereby constructing a multi-dimensional matching cost representation that jointly encodes spatial location, view index, and disparity assumption; a 3D convolutional layer is used to regularize the cost volume to aggregate contextual information, and a continuous disparity map is obtained by regression using a Soft-argmin operation; the Soft-argmin operation obtains the disparity probability distribution by performing Softmax normalization on the regularized cost volume along the disparity dimension, and the Softmax function maps the similarity scalar to a normalized weight in probabilistic form, ensuring that the sum of the weights of all neighboring pixels participating in the aggregation is 1; then, the mathematical expectation of all disparity assumption values ​​is calculated, thereby obtaining a continuous disparity estimate with sub-pixel accuracy.

[0021] Preferably, in step (4), the structural information is enhanced by using the repeating textures within the image through the in-view nonlocal similarity module, suppressing noise and artifacts that do not conform to the statistical laws of natural images; firstly, the feature map output from the previous-level network or the initialized feature map is received as input; the input feature tensor is decoupled in the angular dimension and decomposed into a series of independent single-view image features; the input transmission layer feature dimension is , where the angular resolution is Spatial resolution is , The number of channels is considered as follows after decoupling. An independent Feature slicing; then each decoupled single-view image feature is input in parallel to the nonlocal block unit; the nonlocal block unit restricts the search range to the local neighborhood centered on the current pixel; by calculating the similarity scalar between pixel pairs in the neighborhood, and normalizing the similarity weights using the Softmax operation, the similarity scalar of pixel pairs in the neighborhood is mapped to a normalized weight between 0 and 1 through exponential operation; finally, the feature representation of the current pixel is enhanced by using redundant information in the neighborhood through weighted aggregation; the output of the nonlocal block is added to the original input features in the form of residual connection; all the independently processed single-view features are rearranged and aggregated in the angular dimension to restore the initial light field feature structure, thereby outputting the fine transmission layer and reflection layer features enhanced by the nonlocal prior within the view.

[0022] Preferably, in step (4), the cross-view nonlocal similarity module includes a center view enhancement block (CEB) and a side view enhancement block (SEB). The input transmission layer features are decomposed into center view features and side view features in the angular dimension; The decomposed center view features, side view features, and disparity are fed into the center view enhancement module. The center view enhancement module uses the disparity to perform a geometric warp operation from the side views to the center view, aligning all side views to the center view. The process iterates through each aligned side view feature, feeding it along with the original center view feature into a nonlocal block network. Nonlocal features aggregating side view information are obtained by calculating cross-view correlations. After all side views have been processed, these weighted feature terms are aggregated and fused to generate the final enhanced center view feature. ; Obtain enhanced center view features Subsequently, this feature was used as key guiding information, along with the original side view features and parallax. Figure 1 The center view features are fed into the side view enhancement module, where a reverse geometric alignment operation is performed based on disparity. This reversely maps the enhanced center view features to the geometric coordinate system of each side view, generating alignment guidance features. Each aligned center view is then sequentially traversed, and the original side view features and the corresponding aligned center view features are input into the nonlocal block network. The cross-view guidance matrix is ​​then calculated. The module obtains aggregated feature terms. This allows the clear structural and texture information restored in the central view to be accurately propagated to each side view; The enhanced center view features are output along with all side view features.

[0023] Preferably, in step (5), the reconstruction module includes a transmission layer reconstruction module and a reflection layer reconstruction module. The transmission layer reconstruction module utilizes the output of the transmission denoising module. Output of the nonlocal similarity module within the view and the output of the cross-view nonlocal similarity module , Reconstruct and update the transmission layer .

[0024] Preferably, in step (5), the transmission layer reconstruction module performs the following steps: (5.1) The reconstruction error term is calculated by solving equation (9) through a one-step gradient descent. , (11) in, It is the first penalty parameter. These are the weight coefficients of the non-local similarity regularization term within the view. These are the weight coefficients of the cross-view nonlocal similarity regularization term. This is a side view of the light field, where 's' represents the side view index. is the central view of the light field, where c represents the index of the central view; (5.2) Utilizing the updated error and denoised features The core image reconstruction step is performed, and the key to this step is finding an optimal solution that simultaneously satisfies the data fidelity constraint and the regularization prior constraint. (12) in It is the second penalty parameter. It is the penalty parameter for auxiliary variables.

[0025] Those skilled in the art will understand that all or part of the steps in the methods of the above embodiments can be implemented by a program instructing related hardware. The program can be stored in a computer-readable storage medium. When executed, the program includes the steps of the methods of the above embodiments. The storage medium can be ROM / RAM, magnetic disk, optical disk, memory card, etc. Therefore, corresponding to the method of the present invention, the present invention also includes a priori knowledge-guided deep unrolling network light field image dereflection device. This device is typically represented in the form of functional modules corresponding to the steps of the method. The device includes: An input module configured to input a mixed light field image with reflection interference; The light field disparity estimation module is configured to predict the disparity map of the center view; The denoising module is configured to feed the current features into the denoising module when entering the deep unfolding iteration stage, and use the convolutional neural network to update the auxiliary variables to extract local priors. The feature extraction module is configured to use the intra-view non-local similarity module to extract single-view texture self-similarity features, and at the same time use the disparity map to guide the cross-view non-local similarity module to perform feature alignment and aggregation, and extract cross-view geometric consistency features. The reconstruction module is configured to feed the denoised auxiliary variables and the output of the intra-view and cross-view non-local similarity modules into the reconstruction module, and update the transmission layer and reflection layer components by gradient descent. The output module is configured to repeatedly execute the denoising module, feature extraction module, and reconstruction module until a preset number of iterations are reached. The final transmission layer features are reconstructed by the decoder, and the light field image after reflection removal is output.

[0026] Preferably, in the light field disparity estimation module, the light field disparity estimation module independently extracts high-dimensional features of each sub-aperture image using a three-layer convolutional network with shared weights; then, in the central view reference coordinate system, different disparity values ​​are assumed for each spatial location within a preset disparity range, and the feature matching cost between the corresponding positions of the central view and each side view is calculated, thereby constructing a multi-dimensional matching cost representation that jointly encodes spatial location, view index, and disparity assumption; a 3D convolutional layer is used to regularize the cost volume to aggregate contextual information, and a continuous disparity map is obtained by regression using a Soft-argmin operation; the Soft-argmin operation obtains the disparity probability distribution by performing Softmax normalization on the regularized cost volume along the disparity dimension, and the Softmax function maps the similarity scalar to normalized weights in probabilistic form, ensuring that the sum of the weights of all neighboring pixels participating in the aggregation is 1; then, the mathematical expectation of all disparity assumption values ​​is calculated, thereby obtaining a continuous disparity estimate with sub-pixel accuracy.

[0027] Preferably, in the feature extraction module, the intra-view nonlocal similarity module utilizes repeating textures within the image to enhance structural information and suppress noise and artifacts that do not conform to the statistical laws of natural images. First, it receives the feature map output from the previous network or the initialized feature map as input. Then, it decouples the input feature tensor in the angular dimension, splitting it into a series of independent single-view image features. The input transmission layer feature dimension is... , where the angular resolution is Spatial resolution is , The number of channels is considered as follows after decoupling. An independent Feature slicing; then each decoupled single-view image feature is input in parallel to the nonlocal block unit; the nonlocal block unit restricts the search range to the local neighborhood centered on the current pixel; by calculating the similarity scalar between pixel pairs in the neighborhood, and normalizing the similarity weights using the Softmax operation, the similarity scalar of pixel pairs in the neighborhood is mapped to a normalized weight between 0 and 1 through exponential operation; finally, the feature representation of the current pixel is enhanced by using redundant information in the neighborhood through weighted aggregation; the output of the nonlocal block is added to the original input features in the form of residual connection; all the independently processed single-view features are rearranged and aggregated in the angular dimension to restore the initial light field feature structure, thereby outputting the fine transmission layer and reflection layer features enhanced by the nonlocal prior in the view; The cross-view nonlocal similarity module includes the center view enhancement block (CEB) and the side view enhancement block (SEB). The input transmission layer features are decomposed into center view features and side view features in the angular dimension; The decomposed center view features, side view features, and disparity are fed into the center view enhancement module. The center view enhancement module uses the disparity to perform a geometric warp operation from the side views to the center view, aligning all side views to the center view. The process iterates through each aligned side view feature, feeding it along with the original center view feature into a nonlocal block network. Nonlocal features aggregating side view information are obtained by calculating cross-view correlations. After all side views have been processed, these weighted feature terms are aggregated and fused to generate the final enhanced center view feature. ; Obtain enhanced center view features Subsequently, this feature was used as key guiding information, along with the original side view features and parallax. Figure 1 The center view features are fed into the side view enhancement module, where a reverse geometric alignment operation is performed based on disparity. This reversely maps the enhanced center view features to the geometric coordinate system of each side view, generating alignment guidance features. Each aligned center view is then sequentially traversed, and the original side view features and the corresponding aligned center view features are input into the nonlocal block network. The cross-view guidance matrix is ​​then calculated. The module obtains aggregated feature terms. This allows the clear structural and texture information restored in the central view to be accurately propagated to each side view; The enhanced center view features are output along with all side view features.

[0028] Preferably, in step (5), the reconstruction module includes a transmission layer reconstruction module and a reflection layer reconstruction module. The transmission layer reconstruction module utilizes the output of the transmission denoising module. Output of the nonlocal similarity module within the view and the output of the cross-view nonlocal similarity module , Reconstruct and update the transmission layer ; The transmission layer reconstruction module performs the following steps: (5.1) The reconstruction error term is calculated by solving equation (9) through a one-step gradient descent. , (11) in, It is the first penalty parameter. These are the weight coefficients of the non-local similarity regularization term within the view. It is the weight coefficient of the cross-view nonlocal similarity regularization term; This is a side view of the light field, where 's' represents the side view index. is the central view of the light field, where c represents the index of the central view; (5.2) Utilizing the updated error and denoised features The core image reconstruction step is performed, and the key to this step is finding an optimal solution that simultaneously satisfies the data fidelity constraint and the regularization prior constraint. (12) in It is the second penalty parameter. It is the penalty parameter for auxiliary variables.

[0029] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below in conjunction with specific implementation methods.

[0030] To address the ill-posed inverse problem of light field reflection removal, this invention constructs an optimization model that includes data fidelity terms and non-local regularization terms. Let the input mixed light field be... The transmission layer is The reflective layer is According to the additive superposition model The problem of light field reflection removal is modeled as the following joint optimization problem: (1) in, This indicates minimizing T and R, meaning minimizing the values ​​of T and R calculated by the subsequent objective function. (Data fidelity term) By constraining the sum of the squares of all elements using the L2 norm (i.e., the square root of the sum of squares of all elements), the sum of the transmissive and reflective layers approximates the original observed image, thus ensuring that the decomposition results conform to the accuracy of the physical imaging model. Intra-view nonlocal similarity regularization term. This method leverages texture self-similarity within the spatial domain of a single-view image to enhance structural information and suppress noise and artifacts that do not conform to the statistical regularities of natural images. A cross-view nonlocal similarity regularization term is also included. To address the high-dimensional geometric characteristics of the light field, and based on the parallax difference between the transmission and reflection layers, effective differentiation between the two is achieved by constraining the geometric consistency among multiple views. Prior terms. We introduce a deep learning-based implicit denoising prior and use auxiliary variables to constrain the feature space to further improve the final image reconstruction quality. Parameters , , and These are the corresponding weighting coefficients.

[0031] Based on the nonlocal self-similarity prior of an image, an image patch can be effectively represented by a linear combination of its nonlocally similar neighboring patches. To characterize the difference between the real image details and the nonlocal predictions, a modeling error term is introduced. The in-view nonlocal regularization term can be represented as the reconstruction error between the transmission layer and its nonlocal predictions: (2) in, This is a side view of the light field, where 's' represents the side view index. It is a nonlocal operator. To reconstruct the residual.

[0032] The IVNL modeling of the reflective layer R adopts the same form, and its expression is as follows. (3) The inherent disparity between different views causes geometric misalignment in cross-view nonlocal modeling. Therefore, this invention first utilizes estimated disparity information to warp other views to the target view to eliminate geometric distortion. This invention first estimates and defines the disparity map of the center view as follows: The distortion operation between the center view and the side view is defined as follows: (4) in, and These respectively represent the indices and Angular coordinate vector. Represents a side view The result of distorting to the center view, and Represents the center view The result of aligning to the side view.

[0033] Based on the above distortion operations, the regularization term of CVNL can be defined as: (5) , This represents a nonlocal aggregation operator derived by calculating the similarity between the center view and the aligned side view.

[0034] Same reflective layer CVNL modeling follows the transmission layer The same process: (6).

[0035] To efficiently solve the aforementioned optimization problem involving complex regularization terms, this invention employs Half-Quadratic Splitting (HQS). The HQS algorithm decouples the original problem by introducing auxiliary variables, transforming it into a sequence of easily solvable subproblems. Specifically, auxiliary variables are introduced... and Approximate respectively and And introduce penalty parameters Equation (1) can be transformed into the following form: (7) in , It is a penalty parameter. , It is an auxiliary variable introduced for the HQS algorithm.

[0036] To address the aforementioned problem, variables are updated alternately while other variables remain constant. Therefore, the problem is decomposed into six sub-problems. For the transmission layer, the optimization problem of equation (7) of this invention can be solved iteratively by solving the following three sub-problems; the reflection layer R is optimized in a similar manner: (8) (9) (10).

[0037] The ICDUN network architecture proposed in this invention comprises K cascaded stages; in this embodiment, K is preferably set to 4. Each stage shares network structure parameters to reduce the number of model parameters and prevent overfitting. The internal structure and data flow of each module are detailed below.

[0038] Light field parallax estimation module

[0039] The light field disparity estimation module, which forms the basis for geometric alignment, is responsible for predicting the center view disparity map from the mixed light field I. First, the module independently extracts high-dimensional features for each sub-aperture image using a three-layer convolutional network with shared weights. Then, in the central view reference coordinate system, different disparity values ​​are assumed for each spatial location within a preset disparity range, and the feature matching cost between the corresponding positions in the central view and each side view is calculated, thus constructing a multi-dimensional matching cost representation that jointly encodes spatial location, view index, and disparity assumptions. Next, a 3D convolutional layer is used to regularize the cost volume to aggregate contextual information, and a soft-argmin operation is used to regress and obtain a continuous disparity map. The soft-argmin operation obtains the disparity probability distribution by performing softmax normalization on the regularized cost volume along the disparity dimension. The softmax function maps the similarity scalar to normalized weights in probabilistic form, ensuring that the sum of the weights of all neighboring pixels participating in the aggregation is 1. Then, the expected value of all disparity assumptions is calculated to obtain a sub-pixel precision continuous disparity estimate. The output... Maintaining the same spatial resolution as the original image, the geometric depth of the transmission layer was accurately characterized, providing crucial alignment guidance for subsequent CVNL modules.

[0040] In-view nonlocal similarity module

[0041] In step (4), the in-view nonlocal similarity module is also a result of this invention, the main contents of which are described below: Natural images generally exhibit texture self-similarity within the spatial domain of a single view. That is, a patch at a certain location in an image often shows structurally highly similar neighboring patches at other locations within the same image. For example... Figure 1 As shown in (b), this invention utilizes repeating textures within the image to enhance structural information and suppress noise and artifacts that do not conform to the statistical laws of natural images. The process is as follows: Figure 3 As shown. First, it receives the feature map output from the previous layer or the initialized feature map as input. The input feature tensor is decoupled along the angular dimension, splitting it into a series of independent single-view image features. The input transmission layer feature dimension is... , where the angular resolution is Spatial resolution is , The number of channels is considered as follows after decoupling. An independent Feature slicing. Then, the features of each decoupled single-view image are input in parallel into the non-local block unit. For example... Figure 1As shown in (d), the nonlocal block is an extension of the traditional autoregressive model, used to capture long-distance dependencies in the image. In this step, the nonlocal block does not perform a full-image search, but limits the search scope to a local neighborhood centered on the current pixel to balance computational cost and performance. A similarity scalar between pixel pairs within the neighborhood is calculated, and the similarity weights are normalized using a Softmax operation; that is, the similarity scalar of pixel pairs within the neighborhood is mapped to a normalized weight between 0 and 1 through an exponential operation. Finally, redundant information within the neighborhood is used to enhance the feature representation of the current pixel through weighted aggregation. The output of the nonlocal block is added to the original input features through a residual connection to preserve the original information and accelerate gradient propagation. Finally, all independently processed single-view features are rearranged and aggregated in the angular dimension to restore the initial light field feature structure, thereby outputting refined transmission and reflection layer features enhanced by in-view nonlocal priors.

[0042] Cross-view nonlocal similarity module

[0043] In step (4), the cross-view nonlocal similarity module is also a result of this invention, and its overall process is as follows: Figure 4 As shown. The main content is explained below: Cross-view nonlocal similarity (CVNL) modules aim to mine the geometric correlations between different views in a light field image, overcoming the limitation of relying solely on single-view information to distinguish between the transmission and reflection layers. For example... Figure 1 As shown in (c), the module mainly consists of two complementary sub-processing blocks: the center view enhancement block (CEB) and the side view enhancement block (SEB).

[0044] The input consists of the transmission layer features and the disparity map generated by the light field disparity estimation module. First, the input transmission layer features are decomposed into center view features and side view features in the angular dimension.

[0045] Subsequently, the decomposed center view features, side view features, and parallax are sent to the center view enhancement module, and the process is as follows: Figure 5 As shown. To eliminate parallax misalignment between multiple views, the module utilizes parallax to perform a geometric warp operation from side views to the center, aligning all side views to the center view to obtain the desired result. By aligning all side view features to the coordinate system of the center view, the module ensures that the side view features are strictly consistent with the center view in spatial location, laying the geometric foundation for subsequent feature aggregation. The module sequentially traverses each aligned side view feature, feeding it along with the original center view features into a nonlocal block network. By calculating cross-view correlations, it obtains nonlocal features that aggregate side view information. After all side views have been processed, these weighted feature items are aggregated and fused to generate the final enhanced center view feature. This module aims to enhance the center view, ensuring that the center view can accurately extract complementary texture details from multiple side views, thereby effectively suppressing glare interference.

[0046] After obtaining the enhanced center view features, these features are used as key guiding information, along with the original side view features and disparity. Figure 1 The data is fed into the Side View Enhancement (SEB) module, and its process is shown in Figure 6. First, an inverse geometric alignment operation is performed based on disparity, mapping the enhanced center view features back to the geometric coordinate system of each side view to generate alignment guidance features. Then, the module sequentially traverses each aligned center view, inputting the original side view features and the corresponding aligned center view features into the nonlocal block network. This is achieved by calculating the cross-view guidance matrix. The module obtains aggregated feature terms. This process accurately propagates the clear structural and texture information recovered in the central view to each side view. It ensures that all side views retain their individual view characteristics while maintaining a high degree of synchronization with the central view in terms of geometry and consistency, effectively resolving the issues of residual reflections and blurred textures in the side views.

[0047] Finally, the enhanced center view features are output along with all side view features. At this point, the module output contains all view transmission layer features of the light field with cross-view non-local prior constraints, and is passed to the subsequent reconstruction module for further optimization iterations.

[0048] Transmission layer noise reduction module and reflection layer noise reduction module

[0049] Both the Transmission Denoising Module (TDM) and the Reflection Denoising Module (RDM) employ a U-Net-based denoising structure. This architecture follows an encoder-decoder design and introduces skip connections, effectively recovering high-frequency details such as edges and textures. The encoder consists of four encoding blocks, and the decoder consists of four decoding blocks. Each encoding block comprises three 3×3 convolutional layers, one residual layer, and a ReLU nonlinear activation function to generate a 64-channel feature map. The decoder reconstructs the image from the four decoding blocks, each containing three convolutional layers and one residual layer. Importantly, this module is integrated into a dense recurrent framework: at each stage, it not only utilizes the hidden state from the previous time step but also fuses information from all earlier stages through long-range skip connections. This design achieves richer information flow across stages while sharing weights between time steps, thereby improving denoising performance without increasing model parameters.

[0050] Transmission and Reflection Layer Reconstruction Module

[0051] The transmission reconstruction module (TRM) utilizes the output of the transmission denoising module. Output of the nonlocal similarity module within the view and the output of the cross-view nonlocal similarity module , The updated transmission layer can be rebuilt in two steps. .

[0052] First, the reconstruction error term is calculated by solving equation (9) using a one-step gradient descent approach. , (11) in, It is the first penalty parameter. These are the weight coefficients of the non-local similarity regularization term within the view. It is the weight coefficient of the cross-view nonlocal similarity regularization term. This is a side view of the light field, where 's' represents the side view index. 'c' represents the central view of the light field, where 'c' denotes the index of the central view.

[0053] Subsequently, using the updated error Features after denoising The TRM performs the core image reconstruction step. The key to this step is finding an optimal solution that simultaneously satisfies the data fidelity constraint, i.e. and regularization prior constraints. Through the study of... The subproblem is solved by performing a gradient descent step on equation (9) to obtain the first subproblem. fine transmission layer of the stage : (12) in It is the second penalty parameter. These are the weight coefficients of the non-local similarity regularization term within the view. These are the weight coefficients of the cross-view nonlocal similarity regularization term. It is the penalty parameter for auxiliary variables. This is a side view of the light field, where 's' represents the side view index. Let be the center view of the light field, and 'c' denote the center view index. The above update formula visually illustrates how the image of the transmission layer in the TRM is progressively recovered. The RRM of the reflection layer follows a similar process.

[0054] To fully verify the effectiveness of the proposed ICDUN method, comparative experiments were conducted on synthetic and real-world datasets against several existing state-of-the-art methods. The comparison methods covered three main technical approaches: first, single-image-based reflection removal methods, including ERRNet, IBCNN, Kim et al., Dong et al., V-DESIRR, MCDRNet, and DExNet; second, multi-image-based reflection removal methods, including Liu et al. and Niklaus et al.; and third, light field-based reflection removal methods, including Li et al., MIBRR, LFRBSNet, and DMINet. Furthermore, to verify the model's robustness under sparse views, comparative experiments were conducted at a 3×3 angular resolution.

[0055] Figure 9 The average augmentation performance of 13 competing methods on synthetic and real datasets is presented. It can be seen that the model of this invention achieves the highest PSNR and the highest SSIM result, respectively. Figure 7 , Figure 8 The results are presented for visualization. All images shown are center views of the light field images. It can be seen that this invention can accurately recover subtle texture features and sharp edge structures, significantly outperforming the comparative methods.

[0056] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below in conjunction with specific implementation methods.

[0057] Dataset

[0058] This invention utilizes a state-of-the-art light field (LF) reflection dataset, comprising 400 synthetic scenes and 50 real-world scenes for training, and 20 synthetic scenes and 20 real-world scenes for testing. All light field scenes were captured using a Lytro camera, whose parameters (such as focal length and aperture diameter) can be easily adjusted to the desired settings via a knob. The acquired raw light field data was then decoded into a 4D light field representation using LF Toolbox v0.4. Furthermore, these scenes encompass diverse lighting conditions, rich indoor and outdoor environments, and a large number of object categories, including people, buildings, animals, and plants.

[0059] Comparison Algorithm

[0060] This invention was compared with several existing state-of-the-art methods on both synthetic and real-world datasets. These include single-image-based reflection removal methods such as ERRNet, IBCLN, Kim et al., Dong et al., V-DESIRR, MCDRNet, and DExNet; multi-image-based reflection removal methods such as Liu et al. and Niklaus et al.; and light field-based reflection removal methods such as Li et al., MIBRR, LFRBSNet, and DMINet. For single-image reflection removal methods, this invention feeds each view of the light field individually into the single-image reflection removal network for inference, obtaining the transmission layer prediction result for the corresponding view. Finally, the average PSNR and SSIM of all view results are calculated and used as the final transmission layer estimation result for the scene.

[0061] Experimental setup

[0062] The ICDUN implementation of this invention is based on PyTorch and trained on an RTX 4090 GPU using the Adam optimizer (β1=0.9 and β2=0.999, where β1 and β2 represent the exponential decay rates of gradient mean estimation and gradient non-central variance estimation, respectively), for a total of 300 training epochs. The angular resolution of the input light field is 5 × 5. Data augmentation techniques such as random rotation and flipping are employed during training. This invention extracts overlapping image patches with a spatial size of 32 × 32 from the image as network input. The initial learning rate during training is 2 × 10⁻⁶. −4 The batch size during training is set to 4, meaning that in each iteration optimization step, the network processes 4 sets of light field data samples in parallel.

[0063] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. Any simple modifications, equivalent changes, and alterations made to the above embodiments based on the technical essence of the present invention shall still fall within the protection scope of the present invention.

Claims

1. A method for light field image dehazing using prior knowledge guided deep unfolding network, characterized in that: It includes the following steps: (1) Input a mixed light field image with reflection interference; (2) Predict the center view disparity map using the light field disparity estimation module; (3) Enter the deep unfolding iteration stage, send the current features into the denoising module, and use the convolutional neural network to update the auxiliary variables to extract local priors; (4) Extract single-view texture self-similarity features using the intra-view non-local similarity module, and simultaneously use the disparity map to guide the cross-view non-local similarity module to perform feature alignment and aggregation, and extract cross-view geometric consistency features; (5) The denoised auxiliary variables and the output of the non-local similarity module within and across views are sent to the reconstruction module, and the transmission layer and reflection layer components are updated by gradient descent. (6) Repeat steps (3)-(5) until the preset number of iterations is reached. The final transmission layer features are reconstructed by the decoder, and the light field image after reflection is removed is output.

2. The prior-knowledge guided deep unfolding network-based light field image dehazing method of claim 1, wherein: In step (2), the light field disparity estimation module uses a three-layer convolutional network with shared weights to independently extract high-dimensional features of each sub-aperture image; Subsequently, in the central view reference coordinate system, different disparity values ​​are assumed for each spatial location within a preset disparity range, and the feature matching cost between the corresponding positions of the central view and each side view is calculated, thereby constructing a multi-dimensional matching cost representation that jointly encodes spatial location, view index, and disparity assumption. A 3D convolutional layer is used to regularize the cost volume to aggregate contextual information, and a continuous disparity map is obtained using a Soft-argmin operation. The Soft-argmin operation obtains the disparity probability distribution by performing Softmax normalization on the regularized cost volume along the disparity dimension. The Softmax function maps the similarity scalar to a normalized weight in probabilistic form, ensuring that the sum of the weights of all neighboring pixels participating in the aggregation is 1. Then, the mathematical expectation of all disparity assumption values ​​is calculated to obtain a continuous disparity estimate with sub-pixel accuracy.

3. The light field image dereflection method using prior knowledge-guided deep unrolling networks according to claim 2, characterized in that: In the step (4), the intra-view non-local similarity module utilizes the repeated texture within the image to enhance the structural information and suppress the noise and artifacts that do not conform to the statistical rules of natural images. First, the feature map output by the previous network or the initialized feature map is received as input. The input feature tensor is decoupled in the angle dimension, and is split into a series of independent single-view image features. The feature dimension of the input transmission layer is , the angle resolution is , the spatial resolution is , , and the number of channels is , and after decoupling, the decoupled feature is regarded as independent feature slices. Then, each single-view image feature after decoupling is input in parallel to the non-local block unit. The non-local block unit limits the search range to the local neighborhood centered on the current pixel. By calculating the similarity scalar between the pixel pairs in the neighborhood, and normalizing the similarity weight by using the Softmax operation, the similarity scalar of the pixel pairs in the neighborhood is mapped to the normalized weight between 0 and 1 by the exponential operation. Finally, the redundant information in the neighborhood is utilized to enhance the feature representation of the current pixel by weighted aggregation. The output of the nonlocal block is added to the original input features through residual connections; all independently processed single-view features are rearranged and aggregated in the angular dimension to restore the initial light field feature structure, thereby outputting the features of the fine transmission layer T and reflection layer R enhanced by in-view nonlocal priors.

4. The prior-knowledge guided deep unfolding network based light field image dehazing method of claim 3, wherein: In step (4), the cross-view nonlocal similarity module includes the center view enhancement block CEB and the side view enhancement block SEB. decomposing the input transmission layer feature into a center view feature T in the angular dimension c with the side view feature T s ; The decomposed center view features, side view features and parallax are sent to a center view enhancement module, which performs a geometric warping operation from side to center using the parallax, aligns all side views to the center view to obtain , each aligned side view feature is sent to a non-local block network together with the original center view feature, and a non-local feature of aggregated side view information is obtained by calculating cross-view correlation, and after the processing of all side views is completed, the weighted feature items are aggregated and fused to generate the final enhanced center view features ; In obtaining the enhanced center view features After that, the features are taken as key guidance information, together with the original side view features and the disparity map, into the side view enhancement module to perform a reverse geometric alignment operation based on the disparity to reversely map the enhanced center view features into the geometric coordinate system of each side view to generate aligned guidance features; Each aligned center view is sequentially traversed, and the original side view features and the corresponding aligned center view are input into the nonlocal block network. By computing a cross-view guidance matrix , the module obtains aggregated feature terms , thereby accurately propagating the recovered clean structure and texture information in the center view to each side view; The enhanced center view features are output along with all side view features.

5. The prior-knowledge guided deep unfolding network based light field image dehazing method of claim 4, wherein: In step (5), the reconstruction module includes a transmission layer reconstruction module and a reflection layer reconstruction module. The transmission layer reconstruction module utilizes the output of the transmission denoising module , the output of the intra-view non-local similarity module , and the output of the cross-view non-local similarity module , , reconstructs the updated transmission layer .

6. The prior-knowledge guided deep unfolding network for haze removal of light field image method of claim 5, wherein: In step (5), the transmission layer reconstruction module performs the following steps: (5.1) The reconstruction error term is calculated by solving equation (9) with one step of gradient descent , (11) wherein, is a first penalty parameter, is a weight coefficient of the intra-view non-local similarity regularization term, is a weight coefficient of the cross-view non-local similarity regularization term, is a side view of the light field, s denotes a side view index, is a center view of the light field, c denotes a center view index; (5.2) Utilizing the updated error and denoised features The core image reconstruction step is performed, and the key to this step is finding an optimal solution that simultaneously satisfies the data fidelity constraint and the regularization prior constraint. (12) in It is the second penalty parameter. It is the penalty parameter for auxiliary variables.

7. A device for dereflecting light field images using a priori knowledge-guided deep unrolling networks, characterized in that: It includes: An input module configured to input a mixed light field image with reflection interference; The light field disparity estimation module is configured to predict the disparity map of the center view; The denoising module is configured to feed the current features into the denoising module when entering the deep unfolding iteration stage, and use the convolutional neural network to update the auxiliary variables to extract local priors. The feature extraction module is configured to use the intra-view non-local similarity module to extract single-view texture self-similarity features, and at the same time use the disparity map to guide the cross-view non-local similarity module to perform feature alignment and aggregation, and extract cross-view geometric consistency features. The reconstruction module is configured to feed the denoised auxiliary variables and the output of the intra-view and cross-view non-local similarity modules into the reconstruction module, and update the transmission layer and reflection layer components by gradient descent. The output module is configured to repeatedly execute the denoising module, feature extraction module, and reconstruction module until a preset number of iterations are reached. The final transmission layer features are reconstructed by the decoder, and the light field image after reflection removal is output.

8. The optical field image dereflection device for a priori knowledge-guided deep unrolling network according to claim 7, characterized in that: In the light field disparity estimation module, the light field disparity estimation module uses a three-layer convolutional network with shared weights to independently extract high-dimensional features of each sub-aperture image; Subsequently, in the central view reference coordinate system, different disparity values ​​are assumed for each spatial location within a preset disparity range, and the feature matching cost between the corresponding positions of the central view and each side view is calculated, thereby constructing a multi-dimensional matching cost representation that jointly encodes spatial location, view index, and disparity assumption. A 3D convolutional layer is used to regularize the cost volume to aggregate contextual information, and a continuous disparity map is obtained by regression using a soft-argmin operation. The soft-argmin operation obtains the disparity probability distribution by performing softmax normalization on the regularized cost volume along the disparity dimension. The softmax function maps the similarity scalar to normalized weights in probabilistic form, ensuring that the sum of the weights of all neighboring pixels participating in the aggregation is 1. Then, the mathematical expectation of all disparity assumption values ​​is calculated to obtain a continuous disparity estimate with sub-pixel accuracy.

9. The optical field image dereflection device for a priori knowledge-guided deep unrolling network according to claim 8, characterized in that: In the feature extraction module, the intra-view nonlocal similarity module utilizes repetitive textures within the image to enhance structural information and suppress noise and artifacts that do not conform to the statistical laws of natural images. First, it receives the feature map output from the previous network or an initialized feature map as input. The input feature tensor is decoupled along the angular dimension, splitting it into a series of independent single-view image features. The input transmission layer feature dimension is... , where the angular resolution is Spatial resolution is , The number of channels is considered as follows after decoupling. An independent Feature slicing; then each decoupled single-view image feature is input in parallel to a nonlocal block unit; the nonlocal block unit restricts the search range to a local neighborhood centered on the current pixel; by calculating the similarity scalar between pixel pairs in the neighborhood, and normalizing the similarity weights using the Softmax operation, the similarity scalar of pixel pairs in the neighborhood is mapped to a normalized weight between 0 and 1 through exponential operation; finally, the feature representation of the current pixel is enhanced by using redundant information in the neighborhood through weighted aggregation; The output of the nonlocal block is added to the original input features through residual connections; all independently processed single-view features are rearranged and aggregated in the angular dimension to restore the initial light field feature structure, thereby outputting the fine transmission layer T and reflection layer R features enhanced by in-view nonlocal priors; The cross-view nonlocal similarity module includes the center view enhancement block (CEB) and the side view enhancement block (SEB). The input transmission layer features are decomposed into center view features and side view features in the angular dimension; The decomposed center view features, side view features, and disparity are fed into the center view enhancement module. The center view enhancement module uses the disparity to perform a geometric warp operation from the side views to the center view, aligning all side views to the center view. The process iterates through each aligned side view feature, feeding it along with the original center view feature into a nonlocal block network. Nonlocal features aggregating side view information are obtained by calculating cross-view correlations. After all side views have been processed, these weighted feature terms are aggregated and fused to generate the final enhanced center view feature. ; Obtain enhanced center view features Then, this feature is used as key guiding information and is sent to the side view enhancement module along with the original side view features and disparity map. Based on the disparity, a reverse geometric alignment operation is performed to reverse map the enhanced center view features to the geometric coordinate system of each side view to generate alignment guiding features. Each aligned center view is sequentially traversed, and the original side view features and the corresponding aligned center view are input into the nonlocal block network. By calculating the cross-view guidance matrix The module obtains aggregated feature terms. This allows the clear structural and texture information restored in the central view to be accurately propagated to each side view; The enhanced center view features are output along with all side view features.

10. The optical field image dereflection device for a priori knowledge-guided deep unrolling network according to claim 9, characterized in that: In step (5), the reconstruction module includes a transmission layer reconstruction module and a reflection layer reconstruction module. The transmission layer reconstruction module utilizes the output of the transmission denoising module. Output of the nonlocal similarity module within the view and the output of the cross-view nonlocal similarity module , Reconstruct and update the transmission layer ; The transmission layer reconstruction module performs the following steps: (5.1) The reconstruction error term is calculated by solving equation (9) through a one-step gradient descent. , (11) in, It is the first penalty parameter. These are the weight coefficients of the non-local similarity regularization term within the view. These are the weight coefficients of the cross-view nonlocal similarity regularization term. This is a side view of the light field, where 's' represents the side view index. is the central view of the light field, where c represents the index of the central view; (5.2) Utilizing the updated error and denoised features The core image reconstruction step is performed, and the key to this step is finding an optimal solution that simultaneously satisfies the data fidelity constraint and the regularization prior constraint. (12) in It is the second penalty parameter. It is the penalty parameter for auxiliary variables.