Underwater image restoration method based on adaptive multi-scale large kernel attention module
By using an adaptive multi-scale large kernel attention module and directional anisotropic convolution, the problem of insufficient adaptability of underwater image restoration methods in different water environments is solved, achieving robust global illumination restoration and detail enhancement, thus improving image quality.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGDONG OCEAN UNIVERSITY
- Filing Date
- 2026-02-12
- Publication Date
- 2026-06-23
AI Technical Summary
Traditional underwater image restoration methods struggle to adapt to different aquatic environments, often resulting in artifacts, color deviations, and loss of detail. Furthermore, traditional convolutional networks are less effective at capturing long-range dependencies.
An adaptive multi-scale large kernel attention module is adopted, which decomposes the image into low-frequency and high-frequency components through Haar discrete wavelet transform. It combines a directional anisotropic convolution module and a hybrid domain attention module to perform orientation alignment and global structure modeling. In the low-frequency path, a PolyKernel backbone and an adaptive multi-scale large kernel attention module are introduced to compensate for illumination attenuation and contrast compression.
It achieves robust global illumination restoration under varying lighting conditions, enhances the restoration of local details and global structure, reduces artifacts, and maintains the consistency of the image's spatial domain and radiometric metrics.
Smart Images

Figure CN122265098A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image processing technology, and in particular to an underwater image restoration method based on an adaptive multi-scale large kernel attention module. Background Technology
[0002] Due to the wavelength-dependent absorption and scattering characteristics of light in water, underwater images often exhibit significant color casts, low contrast, loss of detail, and blurring. These degradation phenomena significantly weaken the robustness and reliability of subsequent visual tasks such as detection, segmentation, and 3D reconstruction. The goal of underwater image restoration is to recover the details and textures lost due to scattering and correct color distortion, thereby obtaining high-quality images and providing reliable visual input for downstream applications.
[0003] In recent years, as visual tasks have increasingly demanded global modeling capabilities and efficient inference, traditional convolutional algorithms, limited by their small receptive field, have struggled to effectively capture long-range dependencies and have gradually declined in various domains. However, by increasing the kernel size or introducing reparameterization mechanisms, LK-CNNs can approximate the global modeling capabilities of Transformers while maintaining the inherent inductive bias and efficiency of convolutional architectures. Traditional underwater image restoration methods primarily rely on physical models or manual priors. By estimating and correcting illumination, transmittance, or color distribution, these methods can compensate for scattering attenuation and color shifts to some extent, thereby improving visual quality. However, due to their high dependence on pre-set physical assumptions and fixed priors, they struggle to maintain adaptability in different aquatic environments and are prone to artifacts, color deviations, and detail loss in complex or unknown scenes. Summary of the Invention
[0004] The purpose of this invention is to propose an underwater image restoration method based on an adaptive multi-scale large kernel attention module, so as to solve one or more technical problems existing in the prior art, and at least provide a beneficial option or create conditions.
[0005] An underwater image restoration method based on an adaptive multi-scale large kernel attention module includes the following steps: The acquired underwater image is decomposed into four sub-bands using Haar discrete wavelet transform. Each sub-band includes a low-frequency component LL and three directional high-frequency components, which correspond to the vertical high-frequency component LH, the horizontal high-frequency component HL, and the diagonal high-frequency component HH, respectively. The network uses a directional anisotropic convolution module as the core alignment mechanism. This module aligns the ultra-long strip receptive field with the inherent direction of the corresponding wavelet sub-band and performs deep stacking in the high-frequency branch, thereby explicitly reconstructing the anisotropic texture and edge structure damaged by underwater scattering.
[0006] To compensate for finer-grained high-frequency losses, a hybrid domain attention module is introduced at the highest resolution stage of the underwater image decomposed by Haar discrete wavelet transform. This module further enhances microstructure and fine texture through frequency-space interaction. Simultaneously, a composite shape convolution module is added at the bottleneck position of the high-frequency branch. This module uses a polymorphic large kernel to capture long-range dependencies and global anisotropic structures, strengthening directional information at a global scale. The synergistic effect of these components enables the high-frequency branch to simultaneously achieve local detail enhancement, orientation alignment, and global structural modeling within a lightweight design.
[0007] The low-frequency path employs the PolyKernel backbone as a large kernel encoder-decoder to model global brightness and contrast variations. In particular, an adaptive multi-scale large kernel attention (AM-LKA) module is introduced at the bottleneck stage to provide more flexible receptive field configuration and stronger contextual modeling capabilities, thereby achieving more robust global illumination recovery under varying lighting conditions.
[0008] After processing each frequency branch independently, the network re-integrates high- and low-frequency features into the spatial domain using inverse wavelet transform (IWT). Subsequently, a lightweight unified output refinement-adjustment head (URAH) performs structural refinement and hue normalization on the fused result to ensure consistency between the spatial domain and radiometric metrics.
[0009] Furthermore, the specific implementation method of the directional anisotropic convolution module is as follows: for the high-frequency subbands LH, HL, and HH, the directional anisotropic convolution module is responsible for modeling the corresponding vertical, horizontal, and diagonal details, respectively; to achieve direction-aware large kernel modeling, the directional anisotropic convolution module adopts depthwise separable convolution consistent with the subband direction: LH uses k×1, HL uses 1×k, and HH uses k×k; the calculation method is as follows: ; in, This represents a depthwise convolution with a specific orientation. For channel mixing and feature fusion, residual connections X+ (·) are used for stable optimization and to maintain directional consistency, where X∈ As input features, X is a fourth-order tensor, C is the number of channels, H is the height, W is the width, and Y is the output feature. Multiple directional anisotropic convolutional modules are stacked at the deepest position of the high-frequency branch, which has the largest effective receptive field and the highest density of directional high-frequency information. Multi-layer stacking can gradually enhance the directional consistent response with lower computational cost, thereby more effectively compensating for residual anisotropic degradation.
[0010] Furthermore, the degradation of underwater images in the low-frequency domain is mainly dominated by large-scale illumination attenuation and contrast compression, which stem from wavelength-dependent absorption and spatially correlated scattering. To more effectively mitigate this long-range degradation, we replace the fixed-scale LKA at the bottleneck of the PolyKernel backbone with an adaptive multi-scale large kernel attention module (AM-LKA).
[0011] The adaptive multi-scale large-kernel attention module is located in the bottleneck layer, where features have the smallest spatial resolution and the largest effective receptive field. This allows for more flexible receptive field adaptation, enabling more efficient modeling of global correlations and illumination variations. The module consists of three dilated depthwise convolutional branches with kernel sizes of 7, 11, and 21, and dilation rates of 3, 2, and 1. These different dilation settings provide heterogeneous spatial sampling densities while maintaining a comparable large receptive field. Dense sampling paths preserve local contrast and detail texture, while sparse sampling paths are used to model long-range scattering dependencies, achieving an effective balance between detail fidelity and global smoothness.
[0012] The adaptive multi-scale large-kernel attention module dynamically allocates the weights of each branch based on spatial illumination and energy changes through a softmax gating mechanism. This softmax gating mechanism normalizes the responses of each branch, and feature fusion and refinement are performed through lightweight 1×1 convolutions and residual connections. ; in, (·) represents the i-th convolutional branch, which has the size of the convolutional kernel. With expansion rate These are the attention weights normalized by the softmax gating mechanism, used to adaptively balance multi-scale responses; This represents the weighted output that combines all branches; Pointwise convolution is used for channel blending; residual connections X+ (·) are used to maintain stable training and preserve global brightness consistency; where X∈ Y and Y are the input and output feature maps, respectively.
[0013] By effectively compensating for illumination attenuation and modeling long-range spatial correlations, the adaptive multi-scale big-kernel attention module can establish a stable low-frequency representation, thereby maintaining the consistency of global brightness and contrast. This low-frequency representation provides a reliable radiometric baseline for subsequent frequency fusion, enabling the direction-aligned high-frequency branches to reconstruct details based on a globally balanced tone.
[0014] Furthermore, in underwater image restoration, even after inverse wavelet reconstruction and subband fusion, the frequency-to-spatial domain conversion may still introduce slight spatial misalignment and ringing artifacts due to backscattering and non-uniform attenuation. To address this issue, a single end module, the Unified Output Refinement-Adjustment Head URAH, is introduced after the Inverse Wavelet Transform (IWT). The Unified Output Refinement-Adjustment Head URAH consists of a Structure Alignment Refinement Head SARH and a Global Brightness Fine-tuning module. The refining-adjusting head URAH is composed of a two-stage design; The first-stage structure alignment refining head, SARH, operates in the spatial domain, correcting residual misalignment and suppressing ringing artifacts by stacking three depth residual blocks with multi-scale dilation rates. Each residual block is first channel-expanded using a 1×1 convolution, and then subjected to a dilation rate... A 3×3 depthwise convolution, followed by an activation function. Activation, then projection back to the original channel dimensions using another set of 1×1 convolutions, and finally recalibration via adaptive channel ECA: ; in, Used for point-to-point channel expansion and compression; For the expansion rate The depthwise convolution is used to extract multi-scale spatial context; GELU provides non-linear enhancement; ECA introduces efficient channel attention; and the residual connection xi+ (·) ensures training stability and local consistency; xi and yi represent the input and output of the i-th residual block (i= 1, 2, 3), respectively. The outputs of the three residual blocks are fused using a 3 × 3 convolution and then the global residual is added. ; in, The three residual outputs yi from different dilation scales are converged; ψ(·) represents a 3 × 3 convolution used for spatial fusion and refinement; the residual connection x+(·) combines the fused and refined result with the original input x to obtain the spatially enhanced output. ; exist Based on this, the second-stage global brightness fine-tuning module Perform brightness-gated local contrast enhancement and global tone correction centered on identity mapping; calculate brightness projection according to the BT.601 standard. Y = Projlum( And generate local control through standard convolution and depthwise dilated convolution: ; Where σ(·) is the Sigmoid function. and These represent standard convolution and dilated depthwise convolution, respectively; local contrast enhancement is achieved through weak residual correction. ; in = 0.2 controls the residual strength, the For element-wise multiplication; constrained affine and nonlinear hue parameters are predicted using global average pooling and a lightweight MLP: ; The parameters Limited to a range close to the identity mapping: ; ; ; ; The tanh is the hyperbolic tangent function, and the final global tone mapping is defined as: ; in Modeling channel blending, and Perform channel-by-channel gain and bias adjustments separately. The hue is controlled to be non-linear, while clip(·) is used to limit the output to [0, 1] to maintain numerical stability; Furthermore, the Refinement-Adjustment Head URAH, as the final integration stage, unifies and coordinates the spatial structure and global hue after frequency domain reconstruction; the Structure Alignment Refinement Head SARH directly acts on the reconstructed spatial features, eliminating phase shifts and ringing modes that may occur during subband reconstruction, and achieving structural consistency independent of specific branches; the Global Brightness Fine-tuning Module GIMA performs global brightness normalization and contrast fine-tuning with low overhead, ensuring that the restored result is visually natural and stable.
[0015] The combination of these two features enables the Refine-Adjustment Head URAH to perform a final polishing of the structural integrity and tonal balance of the restored image without relying on any particular branch, providing a robust and consistent output refinement mechanism.
[0016] Furthermore, the training objective aims to simultaneously optimize the pixel-level fidelity, structural consistency, perceptual quality, and frequency consistency of the reconstructed underwater image; given an input-target pair ( , The total loss is defined as: ; in, This represents the Smooth L1 loss function, used to constrain accurate spatial reconstruction and intensity consistency. (Structure term) = (1 − SSIM) complements pixel-level supervision, further enhancing structural consistency by preserving contrast and spatial correlation, which is particularly beneficial for underwater scenes with uneven illumination. To further improve color balance at the perceptual level, we introduce a differentiable loss term based on the underwater evaluation metric UCIQE. This explicit constraint on color saturation, chroma, and overall contrast, combined with wavelet decomposition used in the network, is supplemented by subband supervision to balance the learning process between low and high frequencies. Low-frequency component loss...
[0017] Used to maintain global hue and brightness consistency; while high-frequency loss The recovery of structure and texture is enhanced on three directional subbands (HL, LH, HH). The weighting coefficients for each loss are set experimentally. = 0.2、 =0.01, and = = 0.1. This multi-layered training objective enables the network to simultaneously optimize global color restoration and fine-grained detail reconstruction within a unified space-frequency learning framework.
[0018] The beneficial effects of this invention are as follows: It is a compact and theoretically supported underwater image restoration framework that can jointly utilize frequency decomposition and orientation-aware big kernel modeling. By explicitly separating the directional high-frequency components from the global low-frequency structure, the proposed architecture achieves restoration properties that are difficult to obtain with traditional pure spatial domain networks. A direction-aligned anisotropic convolution strategy is proposed, utilizing the directional semantics inherent in wavelet subbands to recover the fine structure that is easily weakened in underwater scattering. The big kernel design enhanced in the low-frequency path strengthens global consistency, while the URAH head alleviates artifacts and stabilizes the transition from the frequency domain to the spatial domain. Extensive experiments across benchmark datasets verify the effectiveness and universality of the proposed method, achieving stable improvements in both structural fidelity and perceived quality. Attached Figure Description
[0019] The above and other features of the present invention will become more apparent from the detailed description of the embodiments shown in conjunction with the accompanying drawings. In the accompanying drawings, the same reference numerals denote the same or similar elements. Obviously, the drawings described below are merely some embodiments of the present invention. For those skilled in the art, other drawings can be obtained from these drawings without any creative effort. In the drawings: Figure 1 This is a flowchart illustrating the principle of an underwater image restoration method based on an adaptive multi-scale large kernel attention module. Figure 2 This is a flowchart of the overall process for underwater image restoration based on an adaptive multi-scale large kernel attention module; Figure 3 The image shows the visualization results of the underwater image restoration method based on the adaptive multi-scale large kernel attention module in different directional subbands. Figure 4 The image shows a comparison of the visual effects of different underwater image restoration methods based on adaptive multi-scale large kernel attention modules on underwater images. Figure 5 Ablation experimental results of underwater image restoration methods AM-LKA, DAC, and two-stage URAH head based on adaptive multi-scale large kernel attention module. Detailed Implementation
[0020] The following will provide a clear and complete description of the concept, specific structure, and technical effects of the present invention in conjunction with the embodiments and accompanying drawings, so as to fully understand the purpose, solution, and effects of the present invention. It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other.
[0021] In the description of this invention, "several" means one or more, "more than" means two or more, "greater than," "less than," and "exceeding" are understood to exclude the stated number, while "above," "below," and "within" are understood to include the stated number. The use of "first" and "second" in the description is merely for distinguishing technical features and should not be construed as indicating or implying relative importance, or implicitly indicating the number of indicated technical features, or implicitly indicating the order of the indicated technical features.
[0022] like Figure 1 and Figure 2As shown, an underwater image restoration method based on an adaptive multi-scale large-kernel attention module combines explicit frequency decomposition with multi-morphic large-kernel convolution to achieve efficient and direction-aware underwater image restoration. The underwater image input system is decomposed into four sub-bands using Haar Discrete Wavelet Transform (DWT): one low-frequency component (LL) and three directional high-frequency components (LH, HL, HH), corresponding to the vertical, horizontal, and diagonal directions, respectively. This explicit frequency domain separation allows the framework to handle illumination-related degradation and structure-related degradation in a decoupled manner. In high-frequency restoration, the network employs Directional Anisotropic Convolution (DAC) as the core alignment mechanism. DAC explicitly reconstructs anisotropic textures and edge structures damaged by underwater scattering by aligning the ultra-long strip receptive field with the inherent direction of the corresponding wavelet sub-band and performing deep stacking in the high-frequency branch. To compensate for finer-grained high-frequency losses, a Hybrid Domain Attention (HDA) module is introduced at the highest resolution stage to further enhance microstructure and fine texture through frequency-space interaction. Simultaneously, a Compound Shape Convolution (CSC) module is added at the bottleneck position of the high-frequency branch to capture long-range dependencies and global anisotropic structures with a polymorphic large kernel, enhancing directional information at a global scale. The synergistic effect of these components enables the high-frequency branch to simultaneously achieve local detail enhancement, orientation alignment, and global structural modeling within a lightweight design. The low-frequency path employs a PolyKernel backbone as a large kernel encoder-decoder to model global brightness and contrast variations. Specifically, an Adaptive Multi-Scale Large Kernel Attention (AM-LKA) module is introduced at the bottleneck stage to provide more flexible receptive field configuration and stronger contextual modeling capabilities, thereby achieving more robust global illumination recovery under varying lighting conditions.
[0023] After processing each frequency branch independently, the network re-integrates high- and low-frequency features into the spatial domain using inverse wavelet transform (IWT). Subsequently, a lightweight unified output refinement-adjustment head (URAH) performs structural refinement and hue normalization on the fused result to ensure consistency between the spatial domain and radiometric metrics.
[0024] The specific implementation method of the directional anisotropic convolution module (DAC) is as follows: For the high-frequency subbands LH, HL, and HH, the directional anisotropic convolution module is responsible for modeling the corresponding vertical, horizontal, and diagonal details, respectively; to achieve direction-aware large kernel modeling, the directional anisotropic convolution module (DAC) adopts depthwise separable convolution consistent with the subband direction: LH uses k×1, HL uses 1×k, and HH uses k×k; the calculation method is as follows: ; in, This represents a depthwise convolution with a specific orientation. For channel mixing and feature fusion, residual connections X+ (·) are used for stable optimization and to maintain directional consistency, where X∈ As input features, X is a fourth-order tensor, C is the number of channels, H is the height, W is the width, and Y is the output feature. Multiple directional anisotropic convolutional modules are stacked at the deepest position of the high-frequency branch, which has the largest effective receptive field and the highest density of directional high-frequency information. Multi-layer stacking can gradually enhance the directional consistent response with lower computational cost, thereby more effectively compensating for residual anisotropic degradation.
[0025] Preferably, such as Figure 3 As shown, the directional selectivity of ultra-long strip convolutional kernels differs when aligned with and misaligned with wavelet subband directions. Since all convolutional kernels were fixed and unlearnable in the experiment, the response differences stem entirely from the geometric characteristics of the receptive field orientation. In synthetic stripe images with a single principal direction, aligned convolution produces a strong and continuous stripe response, while misaligned convolution produces only a weak and fragmented edge pattern, clearly revealing the influence of the receptive field orientation. In real underwater images, aligned convolution produces concentrated and directionally coherent activation across different image types and blur levels, while misaligned convolution exhibits a discrete and discontinuous blur response. These phenomena indicate that a directionally consistent receptive field can more reliably capture high-frequency directional cues remaining after underwater degradation.
[0026] From a frequency domain perspective, underwater imaging leads to non-uniform attenuation of high-frequency textures in different directions. The LH, HL, and HH subbands correspond to the three principal axes where high-frequency energy is most concentrated, and the attenuation of high-frequency components is particularly pronounced in these directions. Aligning the convolution receptive field direction with the principal direction of each subband can enhance the effective energy response in the corresponding direction while suppressing interference from other directions. Therefore, the directional anisotropic convolution module DAC is not simply an amplified activation, but rather achieves directional recovery of anisotropic high-frequency details through frequency domain directional consistency.
[0027] Thanks to this directional consistency, the Directional Anisotropic Convolutional Module (DAC) can utilize frequency domain information more effectively and retain more direction-dependent high-frequency details with the same computational cost. Unlike small kernels or large isotropic kernels, the DAC can capture direction-sensitive frequency domain responses, which is crucial for recovering anisotropic textures degraded by scattering. Simultaneously, the global context of large-kernel convolution modeling suppresses local noise amplification, ensuring stable propagation of directional features in the spatial domain. By combining direction-aligned receptive fields with wavelet-domain directional priors, the DAC achieves accurate anisotropic detail reconstruction with extremely low overhead. Unless otherwise specified, we use k = 21, a setting that achieves a stable accuracy-efficiency balance across multiple datasets. Smaller kernels cannot cover long-range directional structures, while excessively large kernels may introduce slight oversmoothing effects.
[0028] The Figure 3 The image shows the visualization results of aligned and unaligned large stripe convolutions in response to different subband orientations. The rows correspond to: synthesized stripes, clear underwater images, moderately blurred underwater images, and highly blurred underwater images, respectively. HL = horizontal high-frequency detail subband; LH = vertical high-frequency detail subband. "Aligned" indicates that the convolution kernel orientation is consistent with the subband orientation (HL → 1×k, LH → k×1); "Misaligned" indicates that the convolution kernel orientation is swapped (HL → k×1, LH → 1×k). The results show that aligned responses are more directional and concentrated, while unaligned responses exhibit directional confusion and energy dispersion. The diagonal subband (HH) is not shown because the directional anisotropic convolution module DAC uses square convolutions that are insensitive to orientation, making it impossible to generate comparable differences in orientation alignment.
[0029] Furthermore, in underwater image restoration, even after inverse wavelet reconstruction and subband fusion, the frequency-to-spatial domain conversion may still introduce slight spatial misalignment and ringing artifacts due to backscattering and non-uniform attenuation. To address this issue, a single end module, the Unified Output Refinement-Adjustment Head URAH, is introduced after the Inverse Wavelet Transform (IWT). The Unified Output Refinement-Adjustment Head URAH consists of a Structure Alignment Refinement Head SARH and a Global Brightness Fine-tuning module. The refining-adjusting head URAH is composed of a two-stage design; The first-stage structure alignment refining head, SARH, operates in the spatial domain, correcting residual misalignment and suppressing ringing artifacts by stacking three depth residual blocks with multi-scale dilation rates. Each residual block is first channel-expanded using a 1×1 convolution, and then subjected to a dilation rate... A 3×3 depthwise convolution, followed by an activation function. Activation, then projection back to the original channel dimensions using another set of 1×1 convolutions, and finally recalibration via adaptive channel ECA: ; in, Used for point-to-point channel expansion and compression; For the expansion rate The depthwise convolution is used to extract multi-scale spatial context; GELU provides non-linear enhancement; ECA introduces efficient channel attention; and the residual connection xi+ (·) ensures training stability and local consistency; xi and yi represent the input and output of the i-th residual block (i= 1, 2, 3), respectively. The outputs of the three residual blocks are fused using a 3 × 3 convolution and then the global residual is added. ; in, The three residual outputs yi from different dilation scales are converged; ψ(·) represents a 3 × 3 convolution used for spatial fusion and refinement; the residual connection x+(·) combines the fused and refined result with the original input x to obtain the spatially enhanced output. ; exist Based on this, the second-stage global brightness fine-tuning module Perform brightness-gated local contrast enhancement and global tone correction centered on identity mapping; calculate brightness projection according to the BT.601 standard. Y = Projlum( And generate local control through standard convolution and depthwise dilated convolution: ; Where σ(·) is the Sigmoid function. and These represent standard convolution and dilated depthwise convolution, respectively; local contrast enhancement is achieved through weak residual correction. ; in = 0.2 controls the residual strength, the For element-wise multiplication; constrained affine and nonlinear hue parameters are predicted using global average pooling and a lightweight MLP: ; The parameters Limited to a range close to the identity mapping: ; ; ; ; The tanh is the hyperbolic tangent function, and the final global tone mapping is defined as: ; in Modeling channel blending, and Perform channel-by-channel gain and bias adjustments separately. The hue is controlled to be non-linear, while clip(·) is used to limit the output to [0, 1] to maintain numerical stability; Furthermore, the Refinement-Adjustment Head URAH, as the final integration stage, unifies and coordinates the spatial structure and global hue after frequency domain reconstruction; the Structure Alignment Refinement Head SARH directly acts on the reconstructed spatial features, eliminating phase shifts and ringing modes that may occur during subband reconstruction, and achieving structural consistency independent of specific branches; the Global Brightness Fine-tuning Module GIMA performs global brightness normalization and contrast fine-tuning with low overhead, ensuring that the restored result is visually natural and stable.
[0030] The combination of these two features enables the Refine-Adjustment Head URAH to perform a final polishing of the structural integrity and tonal balance of the restored image without relying on any particular branch, providing a robust and consistent output refinement mechanism.
[0031] Furthermore, the training objective aims to simultaneously optimize the pixel-level fidelity, structural consistency, perceptual quality, and frequency consistency of the reconstructed underwater image; given an input-target pair ( , The total loss is defined as: ; in, This represents the Smooth L1 loss function, used to constrain accurate spatial reconstruction and intensity consistency. (Structure term) = (1 − SSIM) complements pixel-level supervision, further enhancing structural consistency by preserving contrast and spatial correlation, which is particularly beneficial for underwater scenes with uneven illumination. To further improve color balance at the perceptual level, we introduce a differentiable loss term based on the underwater evaluation metric UCIQE. This explicit constraint on color saturation, chroma, and overall contrast, combined with wavelet decomposition used in the network, is supplemented by subband supervision to balance the learning process between low and high frequencies. Low-frequency component loss...
[0032] Used to maintain global hue and brightness consistency; while high-frequency loss The recovery of structure and texture is enhanced on three directional subbands (HL, LH, HH). The weighting coefficients for each loss are set experimentally. = 0.2、 =0.01, and = = 0.1. This multi-layered training objective enables the network to simultaneously optimize global color restoration and fine-grained detail reconstruction within a unified space-frequency learning framework.
[0033] Preferably, training and evaluation are performed on the UIEB, EUVP, and LSUI datasets, strictly following the PolyKernel setup, using the exact same dataset versions and the same training and test set partitions to ensure fairness and direct comparability. The training set contains 800 paired samples from UIEB, 2000 pairs from EUVP, and 2000 pairs from LSUI; evaluation is performed on the same test set used by PolyKernel, consisting of 90 UIEB images, 200 EUVP images, and 200 LSUI images. The same unpaired test set, RUIE, is also used to evaluate cross-domain generalization performance.
[0034] For quantitative evaluation, four metrics consistent with PolyKernel are used: PSNR, SSIM, LPIPS, and UCIQE. PSNR and SSIM measure pixel fidelity and structural similarity; LPIPS assesses perceptual quality through deep feature space; and UCIQE is a reference-free underwater image metric used to evaluate color richness, saturation, and contrast. These complementary metrics together provide a comprehensive assessment of reconstruction quality.
[0035] like Figure 4 As shown, the underwater image restoration method based on the adaptive multi-scale large kernel attention module (Ours) is visually compared with two representative traditional methods (ADPCC and WWPF), two high-performance deep learning methods (Semi-UIR and URSCT), and the PolyK-ernel baseline method, while providing corresponding real reference images. It can be observed that the underwater image restoration method based on the adaptive multi-scale large kernel attention module (Ours) can generate sharper edges and finer textures while maintaining good contrast without oversaturation. Furthermore, the colors are more natural, and details in dark areas are better preserved. The visualization results further validate the effectiveness of this method in perceptual quality and structural restoration.
[0036] Preferably, the adaptive big kernel modeling, orientation alignment mechanism and two-stage refining module can be synergistically unified in spatial and frequency feature representation based on ablation experiments, thereby supporting stronger overall restoration quality and robustness.
[0037] Ablation experiments were conducted under standardized training and evaluation settings, and the average results were taken on UIEB / EUVP / LSUI as follows: Figure 5 As shown, the linear model is a U-Net combined with Haar divergent wavelet transform (DWT) or inverse wavelet transform (IWT): the input is first decomposed into low-frequency and high-frequency subbands; the high-frequency (HF) branch retains only HDA and CSC; the low-frequency (LF) path uses a simplified backbone, without the adaptive multi-scale large kernel attention AM-LKA module at the bottleneck layer; and the URAH head is not used. Experimental results show that introducing the AM-LKA module at the bottleneck of the low-frequency path can expand the effective receptive field and enhance global low-frequency modeling, thus bringing consistent improvements across all evaluation metrics; further enabling the subband-aligned directional deep large kernel (1×k / k×1 / k×k) DAC in the high-frequency branch can enhance edge sharpness and reconstruct anisotropic textures, further demonstrating the effectiveness of the proposed alignment mechanism in high-frequency reconstruction. The synergistic effect of adaptive multi-scale large kernel attention AM-LKA and directional anisotropic convolutional DAC constructs a more robust space-frequency backbone network than using either module alone, thereby improving the overall performance.
[0038] Introducing SARH after the inverse wavelet transform (IWT) effectively suppresses spatial misalignment and slight ringing artifacts left over from the frequency-space conversion process, generating more coherent gradients and clearer contours. GIMA stabilizes brightness statistics through subtle and isocentric adjustments with extremely low overhead. When used as a two-stage URAH combination, they produce the most stable and visually consistent results—SARH refines spatial correspondences, while GIMA gently fine-tunes global hue, thus jointly improving spatial-radiative consistency and overall robustness.
[0039] Although the invention has been described in considerable detail and particularly with regard to several of the described embodiments, it is not intended to limit itself to any of these details or embodiments or any particular embodiment, thereby effectively covering the intended scope of the invention. Furthermore, the invention has been described above with respect to embodiments foreseeable by the inventors in order to provide a useful description, and non-substantial modifications to the invention that have not yet been foreseen may still represent equivalent modifications.
Claims
1. An underwater image restoration method based on an adaptive multi-scale large kernel attention module, characterized in that, The method includes the following steps: The acquired underwater image is decomposed into four sub-bands using Haar discrete wavelet transform. Each sub-band includes a low-frequency component LL and three directional high-frequency components. The three directional high-frequency components correspond to the vertical high-frequency component LH, the horizontal high-frequency component HL, and the diagonal high-frequency component HH, respectively. The directional anisotropic convolution module, which serves as the core alignment mechanism, aligns the ultra-long strip receptive field with the inherent direction of the corresponding wavelet sub-band and performs deep stacking in the high-frequency branch. In the highest resolution stage of underwater images decomposed by Haar discrete wavelet transform, a hybrid domain attention module is introduced to further enhance microstructure and fine texture through frequency-space interaction. A composite shape convolution module is added at the bottleneck position of the high-frequency branch to capture long-range dependencies and global anisotropic structures with a multi-state large kernel, thereby enhancing directional information at the global scale. In the low-frequency path, the PolyKernel backbone is used as a large kernel encoder-decoder, and an adaptive multi-scale large kernel attention module is introduced in the bottleneck stage. After processing each frequency branch independently, the network re-integrates the high and low frequency features into the spatial domain through inverse wavelet transform; the fusion result is then structurally refined and hue-normalized through a lightweight unified output refinement-adjustment head.
2. The underwater image restoration method based on an adaptive multi-scale large kernel attention module according to claim 1, characterized in that, The specific implementation method of the directional anisotropic convolution module is as follows: For the high-frequency subbands LH, HL, and HH, the directional anisotropic convolution module is responsible for modeling the corresponding vertical, horizontal, and diagonal details, respectively; to achieve direction-aware large kernel modeling, the directional anisotropic convolution module adopts depthwise separable convolution consistent with the subband direction: LH uses k×1, HL uses 1×k, and HH uses k×k; the calculation method is as follows: ; in, This represents a depthwise convolution with a specific orientation. For channel mixing and feature fusion, residual connections X+ (·) are used for stable optimization and to maintain directional consistency, where X∈ X is the input feature, Y is the output feature, and X is the fourth-order tensor.
3. The underwater image restoration method based on an adaptive multi-scale large kernel attention module according to claim 1, characterized in that, The adaptive multi-scale large kernel attention module consists of three dilated depthwise convolution branches with kernel sizes of 7, 11 and 21, and dilation rates of 3, 2 and 1, respectively. The adaptive multi-scale large kernel attention module dynamically allocates the weights of each branch based on spatial illumination and energy changes through a softmax gating mechanism. This softmax gating mechanism normalizes the responses of each branch, and feature fusion and refinement are performed through lightweight 1×1 convolutions and residual connections. ; in, (·) represents the i-th convolutional branch, which has the size of the convolutional kernel. With expansion rate These are the attention weights normalized by the softmax gating mechanism, used to adaptively balance multi-scale responses; This represents the weighted output that combines all branches; Pointwise convolution is used for channel blending; residual connections X+ (·) are used to maintain stable training and preserve global brightness consistency; where X∈ Y and Y are the input and output feature maps, respectively.
4. The underwater image restoration method based on an adaptive multi-scale large kernel attention module according to claim 1, characterized in that, The lightweight unified output refinement-adjustment head URAH consists of a structure-aligned refinement head SARH and a global brightness fine-tuning module. The refining-adjusting head URAH is composed of a two-stage design; The first-stage structure alignment refining head, SARH, operates in the spatial domain, correcting residual misalignment and suppressing ringing artifacts by stacking three depth residual blocks with multi-scale dilation rates. Each residual block is first channel-expanded using a 1×1 convolution, and then subjected to a dilation rate... A 3×3 depthwise convolution, followed by an activation function. Activation, and then projecting back to the original channel dimensions using another set of 1 × 1 convolutions, finally undergoing adaptive channel ECA recalibration: ; in, Used for point-to-point channel expansion and compression; For the expansion rate The depthwise convolution is used to extract multi-scale spatial context; GELU provides non-linear enhancement; ECA introduces efficient channel attention; and the residual connection xi + (·) ensures training stability and local consistency; xi and yi represent the input and output of the i-th residual block, respectively (i = 1, 2, 3). The outputs of the three residual blocks are fused using a 3 × 3 convolution and then the global residual is added. ; in, The three residual outputs yi from different dilation scales are converged; ψ(·) represents a 3 × 3 convolution used for spatial fusion and refinement; the residual connection x+(·) combines the fused and refined result with the original input x to obtain the spatially enhanced output. ; exist Based on this, the second-stage global brightness fine-tuning module Perform luminance-gated local contrast enhancement and global tone correction centered on identity mapping; calculate luminance projection Y = Projlum ( ) according to the BT.601 standard. And generate local control through standard convolution and depthwise dilated convolution: ; Where σ(·) is the Sigmoid function. and These represent standard convolution and dilated depthwise convolution, respectively; local contrast enhancement is achieved through weak residual correction. ; in = 0.2 controls the residual strength, the For element-wise multiplication; constrained affine and nonlinear hue parameters are predicted using global average pooling and a lightweight MLP: ; The parameters Limited to a range close to the identity mapping: ; ; ; ; The tanh is the hyperbolic tangent function, and the final global tone mapping is defined as: ; in Modeling channel blending, and Perform channel-by-channel gain and bias adjustments separately. The hue nonlinearity is controlled, while clip(·) is used to limit the output to [0, 1] to maintain numerical stability.
5. The underwater image restoration method based on an adaptive multi-scale large kernel attention module according to claim 4, characterized in that, The Refinement-Adjustment Head URAH serves as the final integration stage, unifying and coordinating the spatial structure and global tone after frequency domain reconstruction. The structure alignment refining head SARH directly acts on the reconstructed spatial features, eliminating phase shifts and ringing modes that may occur in subband reassembly, and achieving structural consistency independent of specific branches. The Global Brightness Adjustment Module (GIMA) performs global brightness normalization and contrast fine-tuning with low overhead. By combining the Structure Alignment Refining Head SARH and the Global Brightness Fine-Tuning Module GIMA, the Refining-Adjustment Head URAH can perform the final polishing of the structural integrity and tonal balance of the restored image without relying on any specific branch, providing a robust and uniform output refining mechanism.
6. The underwater image restoration method based on an adaptive multi-scale large kernel attention module according to claim 1, characterized in that, The training objective is to simultaneously optimize the pixel-level fidelity, structural consistency, perceptual quality, and frequency consistency of the reconstructed underwater image; given an input-target pair ( , The total loss is defined as: ; in, The loss function represents the Smooth L1 loss, which constrains precise spatial reconstruction and strength consistency; the structural term... = (1 − SSIM) serves as a complement to pixel-level supervision, further enhancing structural consistency by preserving contrast and spatial correlation, and introducing a differentiable loss term based on the underwater evaluation metric UCIQE. The low-frequency component loss Used to maintain global hue and brightness consistency; while high-frequency loss The structure and texture recovery are enhanced on three directional subbands (HL, LH, HH), and the weighting coefficients of each loss are set experimentally. = 0.2、 = 0.01, and = =0.1.