An image inpainting method and system based on gated AOT-restormer fusion
By employing a gated AOT-Restormer fusion image restoration method, combined with a structure-aware competitive control mechanism, the problems of insufficient detail restoration and poor global consistency in existing image restoration technologies are solved, achieving high-quality image restoration results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NANJING UNIV OF POSTS & TELECOMM
- Filing Date
- 2026-01-28
- Publication Date
- 2026-06-19
AI Technical Summary
Existing image inpainting techniques are inadequate in terms of detail restoration and global consistency. Traditional methods struggle to restore clear textures and edges when there is extensive loss, and existing deep learning methods are not effective in image detail reconstruction.
An image inpainting method based on gated AOT-Restormer fusion is adopted. By introducing a gated AOT sub-module and a Restormer module, combined with a structure-aware competitive control mechanism, the method adaptively selects either local context modeling or global attention modeling path to optimize the fusion of local and global features.
It improves the quality of image restoration, accurately restores image details and maintains global consistency, solves the problem of blurred restoration results when there is a large amount of missing data in traditional methods, and significantly improves the visual effect.
Smart Images

Figure CN122243815A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of computer vision, specifically the area of image inpainting. More specifically, it relates to an image inpainting method and system based on gated AOT-Restormer fusion. Background Technology
[0002] Image inpainting is an important research area in computer vision, aiming to recover a complete image from partially missing or damaged images. Traditional image inpainting methods rely heavily on interpolation of local information and texture replication, typically only addressing small-scale image loss or damage. When dealing with larger areas of loss, these methods often fail to recover details, resulting in blurry and unnatural restorations. Furthermore, existing image inpainting techniques primarily focus on extracting local features while neglecting the overall semantic and structural information of the image, thus leading to poor global consistency in restoration results.
[0003] With the development of deep learning, especially the emergence of Generative Adversarial Networks (GANs), image inpainting technology has made significant progress. Deep neural network-based image inpainting methods can automatically learn high-level image features, generating more natural and realistic inpainting results. However, these methods still face some challenges, particularly in image detail restoration and global consistency. Most existing inpainting methods rely on simple convolutional operations to extract features. While they can recover most of the image content, they still struggle to meticulously reconstruct high-frequency details, resulting in less clear textures and edges in the inpainted image, or a mismatch with the original image. Summary of the Invention
[0004] Objective: To address the issues of insufficient detail restoration and poor global consistency in existing image inpainting techniques, this invention provides an image inpainting method based on gated AOT-Restormer fusion. By introducing a gated AOT submodule and a Restormer module, optimizations are performed on local and global features respectively, thereby effectively improving the quality of image inpainting. The gated AOT-Restormer fusion module, through a structure-aware competitive control mechanism, adaptively selects either local context modeling or global attention modeling paths in different spatial regions, thereby strengthening the AOT branch in regions with complex textures and strengthening the Restormer branch in regions with continuous structures.
[0005] Technical solution: To achieve the above objectives, the technical solution adopted by this invention is as follows: An image inpainting method based on gated AOT-Restormer fusion includes the following steps: Step 1: Collect an image dataset containing diverse scenes and objects, obtain the original images and the images after masking, and construct paired datasets; Step 2: Construct a gated AOT-Restormer fusion GAN network, which includes a generator and a discriminator; introduce a structure-aware competitive gating fusion mechanism of the AOT-Restormer fusion module into the generator; the discriminator adopts the U-Net discriminator, which includes an encoder and a decoder, and introduces a resolution adaptive attention mechanism and multi-scale feature fusion into the U-Net discriminator; Step 3: Extract multi-scale features from the dataset using the encoder, process the features using the AOT-Restormer fusion module, and fuse local porous convolutional features and global self-attention features through a structure-aware competitive control mechanism to obtain fused features; progressively upsample the fused features through the decoder and reconstruct the image, perform adversarial training through the U-Net discriminator, adopt an end-to-end training strategy, jointly optimize the adversarial loss, feature matching loss and perceptual loss to obtain the optimized objective function, and thus obtain the trained gated AOT-Restormer fused GAN network; Step 4: Input the image to be repaired and the corresponding mask image into the trained gated AOT-Restormer fused GAN network for repair, and finally output the repair result.
[0006] Preferably, the AOT-Restormer fusion module extracts local and global features through dilated convolution and self-attention mechanisms, respectively, and adopts a structure-aware competitive control mechanism to adaptively fuse the local and global features to obtain fused features. The AOT-Restormer fusion module includes an AOT submodule and a Restormer submodule. The AOT submodule includes a multi-scale feature extraction unit and a feature fusion unit. The multi-scale feature extraction unit is used to capture local image details and global contextual information features. The feature fusion unit integrates the multi-scale features extracted from each branch through channel concatenation and convolution to generate contextual features with rich semantic representation.
[0007] Preferably, the AOT-Restormer fusion module employs a structure-aware competitive gating control mechanism to adaptively fuse the output features of the AOT submodule and the Restormer submodule, specifically including: The structural feature map representing edge continuity and texture complexity is extracted from the input features by the structure-aware branch, and combined with the mask-guided features, it is input into the control network along with the multi-scale context features output by the AOT submodule and the global attention features output by the Restormer submodule. The control network generates competitive control weights corresponding to the AOT branch and the Restormer branch respectively through multi-layer convolution and Softmax normalization, so that the two form a complementary constraint relationship in the same spatial location. Based on the competitive adjustment weights, the output features of the AOT submodule and the output features of the Restormer submodule are fused element-wise to generate a fused feature that takes into account both local texture consistency and global structural coherence.
[0008] Preferably, the fusion feature satisfies:
[0009] in, Indicates fusion features, These are the multi-scale contextual features output by the AOT module. These are attention-enhanced features output by the Restormer module. and The corresponding control weights are given, and the following conditions are met:
[0010] in (·) is for regulating the network. To perceive structural features. For masking guidance features; The encoder extracts features through three levels of downsampling convolution; the decoder gradually restores the image resolution through upsampling convolution, and finally outputs the repair result.
[0011] Preferred: The optimization objective function is as follows:
[0012]
[0013]
[0014] in, Indicates the total loss. , , The initial baseline weights for each loss term, It is the mask area ratio. It is structural complexity. The average structural complexity of the training set. This is the current repair confidence level. Indicates overall combat losses. The weighting coefficients represent the texture contrast loss. The weights represent the semantic contrast loss. Indicates texture contrast loss. This represents semantic contrast loss. This indicates pixel-level adversarial loss.
[0015] Preferred: The loss function is constructed by combining contrastive loss and adversarial loss, including: Contrast loss is used to extract multi-layer features through a discriminator, and the similarity between the repaired result and the real image at the texture and semantic levels is calculated. Adversarial loss is employed, and the realism of the repaired area is improved through a global discriminator and a pixel-level discriminator; L1 reconstruction loss constrains the restoration of pixel-level consistency between the image and the real image.
[0016] Preferred: The contrast loss employs texture contrast loss and semantic contrast loss, including: Using texture contrast loss, the L1 distance between the repaired feature and the real feature is calculated in the intermediate feature layer of the discriminator, and compared with the distance of the damaged feature. The formula is as follows:
[0017] in, Indicates texture contrast loss. This represents the total number of feature layers used in the discriminator to calculate the texture contrast loss. This represents the feature map of the i-th layer. This represents the feature map of the real image in the i-th layer. This represents the feature map of the damaged image in layer i. For smoothing coefficients; Using semantic contrast loss, similarity is calculated in the final feature layer of the discriminator through normalization and cross-entropy loss, as shown in the formula:
[0018] in, This represents semantic contrast loss. This represents the repaired image features output by the final feature layer of the discriminator. This represents the true image features output by the final feature layer of the discriminator. MLP is a dimensionality reduction network, and CE is the cross-entropy loss function. Using texture separability and semantic separability, the formulas are as follows:
[0019] Where x is the original real image. It's a corrupted image. This is the output restored image. It is the U-Net discriminator. is the feature map output by the i-th intermediate feature layer of the discriminator, and N is the number of discriminator feature layers used to calculate texture contrast. It is the L1 norm, which is the sum of the absolute values of each element of the feature map:
[0020] in It is the embedding vector of the real image in the discriminator's semantic space. It involves repairing the image's embedding vector in the discriminator's semantic space. It is cosine similarity; Then, the weight coefficients of the texture contrast loss are obtained using Softmax normalization. Weight coefficients of semantic contrast loss :
[0021]
[0022] in and These are texture separability and semantic separability, respectively. The total comparison loss is:
[0023] in, Indicates the total comparative loss. The weighting coefficients represent the texture contrast loss. The weight coefficients represent the semantic contrast loss. The adversarial loss employs both global adversarial loss and pixel-level adversarial loss, including: Using global adversarial loss, the discriminator outputs global true / false labels:
[0024] in, Indicates overall combat losses. Represents a generator. Indicates the discriminator, This represents the expectation of the distribution of real image data. This represents the expectation of the distribution of the generated image data. This represents the discriminator's judgment result on the real image. This indicates the discriminator's judgment result on the generated image; Using pixel-level adversarial loss, we train by segmentation and confusion adversarial methods, calculating the loss separately in the masked and unmasked regions. The L1 reconstruction loss constrains the restoration of pixel-level consistency between the original image and the ground truth image.
[0025] in, Indicates pixel-level adversarial loss. Indicates image repair. Represents a real image; The weighting coefficients for each factor, based on mask region complexity and repair confidence, are as follows:
[0026]
[0027]
[0028] in As the initial baseline weights, It is the mask area ratio. It is structural complexity. The average structural complexity of the training set. This is the current repair confidence level; The overall optimization objective function is composed of the following weighted loss functions:
[0029] in, Indicates the total loss. These are the weighting coefficients for each loss term.
[0030] Preferred method for constructing paired datasets in step 1 includes: By combining a random mask generation algorithm with a semantically guided mask generation strategy, a diverse mask library containing different shapes, sizes, and semantic regions is constructed. At the same time, based on image content complexity analysis and regional saliency detection, the distribution density and coverage of the masks are adaptively adjusted to form training sample pairs with multi-level repair difficulty.
[0031] Preferred: Employing a multi-scale feature interaction mechanism, the process includes: The input features are transformed into query Q, key K, and value V through three 1×1 convolutions, respectively, where K and V are downsampled by 2×2 max pooling. Then, the similarity matrix between Q and transpose K is calculated and normalized by softmax to obtain the attention weights, which are then multiplied by the downsampled V and upsampled to the original resolution. Finally, the attention features are fused with the original input features through a learnable parameter γ, which is initially 0 and adaptively adjusted during training to achieve feature enhancement while maintaining spatial structure consistency.
[0032] Another objective of this invention is to provide an image inpainting system based on gated AOT-Restormer fusion, for implementing an image inpainting method based on gated AOT-Restormer fusion, comprising an input unit, a GAN network unit for gated AOT-Restormer fusion, and an output unit, wherein: The input unit is used to input a collected image dataset containing diverse scenes and objects, obtain its original image and the image after masking, and construct a paired dataset. It is used to input the image to be repaired and the corresponding mask image.
[0033] The gated AOT-Restormer fusion GAN network unit is used to construct the gated AOT-Restormer fusion GAN network, which includes a generator and a discriminator. A structure-aware competitive gating fusion mechanism from the AOT-Restormer fusion module is introduced into the generator. The discriminator uses a U-Net discriminator, which includes an encoder and a decoder. A resolution-adaptive attention mechanism and multi-scale feature fusion are introduced into the U-Net discriminator. Multi-scale features are extracted from the dataset using the encoder, and the features are processed using the AOT-Restormer fusion module. A structure-aware competitive control mechanism is used to fuse local porous convolutional features with global self-attention features to obtain fused features. The fused features are then progressively upsampled by the decoder to reconstruct the image. Adversarial training is performed using the U-Net discriminator, employing an end-to-end training strategy. The generative adversarial loss, feature matching loss, and perceptual loss are jointly optimized to obtain the trained gated AOT-Restormer fusion GAN network. The image to be repaired and the corresponding mask image are input into a pre-trained GAN network fused with gated AOT-Restormer for repair, resulting in a repaired image.
[0034] The output unit is used to output the repaired image.
[0035] Compared with the prior art, the present invention has the following advantages: This invention, by fusing Adaptive Dilated Convolution (AOT) and the Restormer structure with a gated fusion mechanism, simultaneously optimizes the restoration of local details and global structure during image inpainting, thereby improving the quality and accuracy of the restored image. Through the innovative fusion of gated AOT and the Restormer structure, it provides an image inpainting method capable of accurately restoring image details while maintaining global consistency. This method not only achieves significant improvements in visual effects but also effectively addresses the shortcomings of traditional image inpainting methods when processing complex images, demonstrating significant application value and broad prospects for wider application. Attached Figure Description
[0036] Figure 1 A flowchart of an image restoration method and system based on gated AOT-Restormer fusion provided by the present invention; Figure 2 A schematic diagram illustrating the principle of Generative Adversarial Networks (GANs). Figure 3 The model result diagram of the image inpainting method and system based on gated AOT-Restormer fusion provided by this invention is shown below; Figure 4 A schematic diagram of the structure-aware competitive gating AOT-Restormer fusion module; Figure 5 This is a diagram of the overall architecture of the U-Net model provided by the present invention. Detailed Implementation
[0037] The present invention will be further illustrated below with reference to the accompanying drawings and specific embodiments. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the invention. After reading this invention, any modifications of the invention in various equivalent forms by those skilled in the art will fall within the scope defined by the appended claims.
[0038] Example 1 This embodiment provides an image inpainting method based on gated AOT-Restormer fusion. It collects image datasets containing diverse scenes and objects, obtaining the original images and images after mask overlay, constructing paired datasets. The input image to be inpainted and the corresponding mask image are used, and multi-scale features are extracted by an encoder. The gated AOT-Restormer fusion module processes the features, fusing local porous convolutional features with global self-attention features through a structure-aware competitive control mechanism. The fused features are progressively upsampled by a decoder to reconstruct the image, and adversarial training is performed using a U-Net discriminator. The model is optimized using a multi-task loss function, and the inpainted image is output. Figure 1-5 As shown, it includes the following steps: Step 1: Collect an image dataset containing diverse scenes and objects, obtain the original images and the images after masking, and construct a dataset in pairs.
[0039] The method for constructing paired datasets involves combining random mask generation algorithms with semantically guided mask generation strategies to build a diverse mask library containing different shapes, sizes, and semantic regions. Simultaneously, based on image content complexity analysis and region saliency detection, the distribution density and coverage of the masks are adaptively adjusted to form training sample pairs with varying levels of restoration difficulty.
[0040] A mask of different shape, size and position is randomly generated on each image in the dataset to ensure that the mask corresponds strictly to the original image, thereby constructing a diverse mask-image pairing dataset. In addition, the image size is reconstructed to 128×128 during the image input process to match the input size of the model.
[0041] Step 2, construct a GAN network that integrates gated AOT and Restormer, such as... Figure 2 As shown, the gated AOT-Restormer fused GAN network includes a generator and a discriminator. A structure-aware competitive gating fusion mechanism from the AOT-Restormer fusion module is introduced into the generator. The discriminator uses a U-Net discriminator, which includes an encoder and a decoder. A resolution-adaptive attention mechanism and multi-scale feature fusion are introduced into the U-Net discriminator.
[0042] like Figure 3 , 4 As shown, a gated AOT-Restormer fusion module is used to process features. The AOT submodule extracts local contextual information through porous convolution operations, while the Restormer module uses a self-attention mechanism to model global feature dependencies. The two modules are dynamically fused through structure-aware competitive weight adjustment, where the weight adjustment is jointly determined by structure-aware features, mask-guided features, and dual-branch semantic features. This ensures that the model prioritizes global attention features in edge structure regions and porous convolution features in texture detail regions. The fused features are then progressively upsampled by the decoder to reconstruct the image. During training, a discriminator based on the U-Net architecture is introduced for adversarial learning, prompting the generator to produce more realistic restoration results.
[0043] The encoder extracts features through three levels of downsampling convolution. The AOT-Restormer fusion module consists of multiple AOT+RestormerBlock modules, which extract local and global features through dilated convolution and self-attention mechanisms, respectively, and fuse the dual-branch features using a structure-aware competitive control mechanism based on Softmax normalization. The decoder gradually restores the image resolution through upsampling convolution, and finally outputs the restored result.
[0044] The AOT-Restormer fusion module includes an AOT submodule and a Restormer submodule. The AOT submodule includes a multi-scale feature extraction unit and a feature fusion unit. The multi-scale feature extraction unit consists of parallel convolutional branches with different dilation rates, used to capture local image details and global contextual information. The feature fusion unit integrates the multi-scale features extracted from each branch through channel concatenation and convolution to generate contextual features with rich semantic representation.
[0045] The AOT submodule employs a multi-branch dilated convolutional structure, with each branch using a different dilation rate to extract multi-scale contextual information. Specific steps include: 1. Perform convolution operations with different dilation rates on the input feature x. Dilated convolution can expand the receptive field without increasing the computational cost and extract multi-scale information by using different dilation rates.
[0046] 2. Concatenate all features with different dilation rates to form a large feature tensor. Then, use a fusion convolutional layer to further process the concatenated multi-scale features to obtain an intermediate feature.
[0047] The Restormer module employs a multi-head self-attention mechanism and a Feed Forward Network (FFN) structure to capture global semantic information. Specific steps include: 1. Using layer normalization, the input features are first standardized. Layer normalization readjusts the features of each channel to a distribution with a mean of 0 and a variance of 1 by calculating the mean and standard deviation of each channel. This process helps stabilize training, improves the model's convergence speed, and prevents gradient vanishing or exploding problems. The formula is as follows:
[0048] in, For input features, and These are the mean and standard deviation of the feature, respectively. A small constant to prevent division by zero errors.
[0049] 2. Multi-head self-attention mechanisms capture global feature dependencies by calculating the relationships between input features. This mechanism first maps input features to a query Q, key K, and value V space, and then calculates the similarity between the query and the key to obtain attention weight matrices. These weight matrices reflect the relevance of input features in the space. Through multiple attention heads, the model can learn different feature relationships in parallel in different subspaces, thereby enhancing the model's expressive power. Specifically, the attention calculation for each head is as follows:
[0050] in, For query, As key, For value, It is the dimension of the query vector. This is the scaling factor. Through the multi-head mechanism, the model can focus on different input subspaces, capturing semantic information at multiple levels. Finally, the outputs of each head are concatenated and a linear transformation is applied to obtain the final output.
[0051] 3. The Feedforward Neural Network (FFN) consists of two fully connected layers, with an activation function inserted between each layer. The main function of FFN is to further process the features obtained through the self-attention mechanism to extract more abstract features. In the Restormer module, FFN helps to further enhance the model's expressive power through layer-by-layer nonlinear transformations. The formula is as follows:
[0052] in, For input features, and This is the weight matrix. and For the bias term, the activation function used is ReLU.
[0053] 4. To preserve the original information of the input features and aid gradient propagation, the Restormer module uses residual connections. The module's output is added to the input features through these residual connections, thus preserving the original information. The formula is as follows:
[0054] F represents the original input features.
[0055] 5. The module's output is added to the input features via residual connections, preserving the original information while enhancing the network's expressive power with new features obtained through self-attention and FFN operations. The formula for calculating residual connections is:
[0056] in The input features are processed by multi-head self-attention and FFN, then added to the input features, and finally normalized by layers to obtain the final output. This operation ensures a more stable flow of information and avoids information loss or gradient vanishing issues that occur in deep networks.
[0057] The Restormer submodule includes a feature preprocessing unit, a multi-head attention generation unit, an attention calculation unit, a feature enhancement unit, and a nonlinear transformation unit, wherein: The feature preprocessing unit standardizes the input features through layer normalization.
[0058] The multi-head attention generation unit uses 1×1 convolutions to generate the query Q, key K, and value V feature matrices in parallel. Define the dimension of the query vector and compute the attention matrix:
[0059] The attention computation unit is based on depthwise separable convolution to achieve efficient spatial attention modeling and compute long-range dependencies between features.
[0060] The feature enhancement unit fuses the attention calculation results with the original input features through residual connections.
[0061] F represents the original input features.
[0062] The nonlinear transformation unit adopts a feedforward network structure that includes 1×1 extended convolution, 3×3 depthwise convolution, GELU activation and 1×1 deconvolution, and achieves deep feature enhancement through secondary residual connections.
[0063] The gated AOT-Restormer fusion module processes features, fusing local dilated convolutional features with global self-attention features through a structure-aware competitive control mechanism. The AOT submodule primarily processes input features through multi-branch dilated convolutions to extract features with multi-scale contextual information. The Restormer module utilizes a multi-head self-attention mechanism and a feedforward network to model the global dependencies of input features and extract global attention features. Specific steps include: 1. For input features x, the AOT submodule extracts multi-scale features and uses convolution operations with different dilation rates to expand the receptive field and capture multi-scale information. Simultaneously, the Restormer module captures global contextual relationships and learns long-distance dependencies in the image through a multi-head self-attention mechanism.
[0064] 2. The output features of the AOT and Restormer submodules are fused and regulated by a structure-aware competitive control network. The control network takes the multi-scale context features output by the AOT submodule, the global attention features output by the Restormer submodule, the edge and texture features extracted by the structure-aware branch, and the mask guidance features as input. Through multi-layer convolution mapping and Softmax normalization, spatial adaptive control weights corresponding to the AOT and Restormer branches are generated respectively, so that the two branches form a complementary constraint relationship in the same spatial position.
[0065] 3. Based on the aforementioned spatial adaptive weight adjustment, the multi-scale contextual features output by the AOT submodule and the attention enhancement features extracted by the Restormer submodule are fused element-wise to generate a fused feature that preserves both local details and global structural consistency.
[0066] in, Indicates fusion features, These are the multi-scale contextual features output by the AOT module. These are attention-enhanced features output by the Restormer module. and The corresponding control weights are given, and the following conditions are met:
[0067] in (·) is for regulating the network. To perceive structural features. This is a mask-guided feature.
[0068] like Figure 5 As shown, the discriminator employs a U-Net structure to accurately determine the differences between generated and real images. The U-Net structure is particularly suitable for image discrimination tasks requiring pixel-level resolution. Its encoder-decoder structure, combined with skip connections, helps to effectively preserve multi-level image feature information, thereby improving discrimination accuracy in image restoration or generation tasks. It includes an encoding path (encoder) and a decoding path (decoder). The encoding path contains multiple downsampling blocks, each employing a convolutional layer with spectral normalization and a ReLU activation function. The specific steps for the encoder are as follows: 1. The discriminator's encoder receives the input image and passes it to the convolutional layers. Each convolutional layer applies a different filter to extract local features. The convolution operation typically involves a 3×3 kernel and appropriate padding so that the spatial dimensions of the output remain the same as the input.
[0069] 2. After each convolutional layer, a pooling layer is applied to downsample the image. Pooling reduces spatial dimensions by taking the maximum value in a region, lowering the image resolution while retaining the most important information. Each downsampling layer reduces the image resolution by half.
[0070] Each layer in the encoder extracts features at different levels. The initial convolutional layers focus on low-level features of the image, while subsequent convolutional layers extract higher-level semantic information. As the network deepens, the features extracted by the convolutional layers become more abstract and higher-level.
[0071] The decoder portion of the U-Net architecture progressively restores resolution through deconvolutional layers and concatenates these features with the corresponding features from the encoder, enhancing both local and global discriminative capabilities. The decoder gradually restores the spatial resolution of the image through deconvolutional layers; each deconvolutional operation attempts to recover detailed features in the image while increasing the spatial size of the feature maps. A key feature of the U-Net architecture is skip connections. After each encoder layer, the output feature map is saved and concatenated in the corresponding layer of the decoder.
[0072] The decoding path contains multiple upsampling blocks, which gradually restore the feature resolution through transposed convolution.
[0073] The encoder extracts features through three levels of downsampling convolution. The decoder gradually restores the image resolution through upsampling convolution, and finally outputs the restored result.
[0074] The fused features are progressively upsampled using a decoder to reconstruct the image, and then adversarial training is performed using a U-Net discriminator. The specific steps are as follows: 1. The discriminator's encoder receives the input image and passes it to convolutional layers. Each convolutional layer applies a different filter to extract local features. The convolution operation typically involves a 3×3 kernel and appropriate padding so that the spatial dimensions of the output remain the same as the input.
[0075] 2. After each convolutional layer, a pooling layer is applied to downsample the image. Pooling reduces spatial dimensions by taking the maximum value in a region, lowering the image resolution while preserving the most important information. Each downsampling layer reduces the image resolution by half.
[0076] The decoder part of the Net structure gradually restores the resolution through deconvolution layers and concatenates it with the corresponding features of the encoder to enhance local and global discrimination capabilities.
[0077] During the training process, multiple loss functions are optimized in parallel and cooperate with each other. After continuous adversarial training between the generator and the discriminator, the repaired image is output.
[0078] Step 3: The dataset is processed by an encoder to extract multi-scale features, capturing visual features at different levels of abstraction. The AOT-Restormer fusion module is used to process these features, fusing local porous convolutional features with global self-attention features through a structure-aware competitive control mechanism to obtain fused features. The fused features are then progressively upsampled by a decoder to reconstruct the image. Adversarial training is performed using a U-Net discriminator, employing an end-to-end training strategy. The adversarial loss, feature matching loss, and perceptual loss are jointly optimized to obtain the target function, resulting in a trained gated AOT-Restormer fused GAN network.
[0079] The model optimization employs a composite loss function, combining adversarial loss, contrastive loss, and reconstruction loss for joint training to ensure visual plausibility while maintaining detail accuracy. The final output is a restored image with high-quality visual effects.
[0080] A loss function is constructed by combining contrastive loss and adversarial loss. The contrastive loss extracts multi-layer features through a discriminator and calculates the similarity between the repaired image and the real image at the texture and semantic levels.
[0081] Adversarial loss enhances the realism of the restored region through a global discriminator and a pixel-level discriminator. Additionally, L1 reconstruction loss constrains the pixel-level consistency between the restored image and the real image.
[0082] The loss function is constructed by combining contrastive loss and adversarial loss, including: Contrast loss is used to extract multi-layer features through a discriminator, and the similarity between the repaired image and the real image at the texture and semantic levels is calculated.
[0083] Adversarial loss is employed, and the realism of the repaired area is improved through a global discriminator and a pixel-level discriminator.
[0084] L1 reconstruction loss constrains the restoration of pixel-level consistency between the image and the real image.
[0085] The contrast loss employs texture contrast loss and semantic contrast loss, including: Using texture contrast loss, the L1 distance between the repaired feature and the real feature is calculated in the intermediate feature layer of the discriminator, and compared with the distance of the damaged feature. The formula is as follows:
[0086] in, Indicates texture contrast loss. This represents the total number of feature layers used in the discriminator to calculate the texture contrast loss. This represents the feature map of the i-th layer. This represents the feature map of the real image in the i-th layer. This represents the feature map of the damaged image in layer i. This is the smoothing coefficient.
[0087] Using semantic contrast loss, similarity is calculated in the final feature layer of the discriminator through normalization and cross-entropy loss, as shown in the formula:
[0088] in, This represents semantic contrast loss. This represents the repaired image features output by the final feature layer of the discriminator. This represents the true image features output by the final feature layer of the discriminator. MLP stands for dimensionality reduction network, and CE stands for cross-entropy loss function.
[0089] The adversarial loss employs both global adversarial loss and pixel-level adversarial loss, including: Using global adversarial loss, the discriminator outputs global true / false labels:
[0090] in, Indicates overall combat losses. Represents a generator. Indicates the discriminator, This represents the expectation of the distribution of real image data. This represents the expectation of the distribution of the generated image data. This represents the discriminator's judgment result on the real image. This indicates the discriminator's judgment result on the generated image.
[0091] Using pixel-level adversarial loss, we train by segmentation and scrambling adversarial methods, calculating the loss separately in masked and unmasked regions. The L1 reconstruction loss constrains the restoration of pixel-level consistency between the original and ground-real images.
[0092] in, Indicates pixel-level adversarial loss. Indicates image repair. Represents a real image.
[0093] The weighting coefficients for each factor, based on mask region complexity and repair confidence, are as follows:
[0094]
[0095]
[0096] in As the initial baseline weights, It is the mask area ratio. It is structural complexity. The average structural complexity of the training set. This is the current repair confidence level.
[0097] The model is trained using a combination of multiple loss functions, and then jointly optimized by combining these functions. The overall optimization objective is a weighted average of the following loss functions:
[0098] in, Indicates the total loss. These are the weighting coefficients for each loss term.
[0099] The objective function is then optimized as follows:
[0100] in, Indicates the total loss. , , The initial baseline weights for each loss term, It is the mask area ratio. It is structural complexity. The average structural complexity of the training set. This is the current repair confidence level. Indicates overall combat losses. The weighting coefficients represent the texture contrast loss. The weights represent the semantic contrast loss. Indicates texture contrast loss. This represents semantic contrast loss. This indicates pixel-level adversarial loss.
[0101] A multi-scale feature interaction mechanism is employed, comprising the following steps: Input features are transformed into query Q, key K, and value V through three 1×1 convolutions, with K and V undergoing 2×2 max pooling downsampling. The similarity matrix between Q and its transpose K is then calculated and normalized using softmax to obtain attention weights, which are then multiplied by the downsampled V and upsampled to the original resolution. Finally, the attention features are fused with the original input features using a learnable parameter γ. γ is initially set to 0 and adaptively adjusted during training to enhance features while maintaining spatial structure consistency.
[0102] Step 4: Input the image to be repaired and the corresponding mask image into the trained gated AOT-Restormer fused GAN network for repair, and finally output the repair result.
[0103] In another embodiment, an image inpainting system based on gated AOT-Restormer fusion is provided to implement an image inpainting method based on gated AOT-Restormer fusion, comprising an input unit, a GAN network unit based on gated AOT-Restormer fusion, and an output unit, wherein: The input unit is used to input a collected image dataset containing diverse scenes and objects, obtain its original image and the image after masking, and construct a paired dataset. It is used to input the image to be repaired and the corresponding mask image.
[0104] The gated AOT-Restormer fusion GAN network unit is used to construct the gated AOT-Restormer fusion GAN network, which includes a generator and a discriminator. A structure-aware competitive gating fusion mechanism from the AOT-Restormer fusion module is introduced into the generator. The discriminator uses a U-Net discriminator, which includes an encoder and a decoder. A resolution-adaptive attention mechanism and multi-scale feature fusion are introduced into the U-Net discriminator. Multi-scale features are extracted from the dataset using the encoder, and the features are processed using the AOT-Restormer fusion module. A structure-aware competitive control mechanism is used to fuse local porous convolutional features with global self-attention features to obtain fused features. The fused features are then progressively upsampled by the decoder to reconstruct the image. Adversarial training is performed using the U-Net discriminator, employing an end-to-end training strategy. The generative adversarial loss, feature matching loss, and perceptual loss are jointly optimized to obtain the trained gated AOT-Restormer fusion GAN network. The image to be repaired and the corresponding mask image are input into a pre-trained GAN network fused with gated AOT-Restormer for repair, resulting in a repaired image.
[0105] The output unit is used to output the repaired image.
[0106] Also includes: Data preprocessing module: Before training begins, input images and mask information are combined with real and damaged images to construct a training dataset. Image data undergoes standardization and normalization to ensure it meets model input requirements. Simultaneously, masks for damaged areas are generated and aligned with image features to ensure the model focuses on repairing those areas. Furthermore, input images are processed using data augmentation techniques to enhance the diversity of training data and improve model robustness.
[0107] Training Module: Through multi-level feature extraction combined with a gated AOT-Restormer fusion module, the model generates repaired images. The training process incorporates contrastive and adversarial losses. The contrastive loss calculates the similarity between the repaired image and the real image at the texture and semantic levels using multi-level features extracted by the discriminator; the adversarial loss enhances the realism of the repaired region through a global discriminator and a pixel-level discriminator. Through a multi-modal attention mechanism, the model can identify and focus on the importance of the repaired region in the image, generating more accurate repair results.
[0108] The image inpainting module, based on a gated AOT-Restormer fusion model, progressively restores image details and integrates multi-scale contextual information, dynamically adjusting feature weights during the inpainting process. The AOT module extracts multi-scale features and enhances the receptive field through dilated convolution; the Restormer module enhances the modeling ability of global features through a multi-head self-attention mechanism. During inpainting, the model adaptively fuses features from the AOT and Restormer modules based on the saliency of the inpainting region, ensuring inpainting quality.
[0109] Real-time feedback module: The system optimizes through reinforcement learning based on user interaction feedback, adjusting feature fusion and attention allocation during the repair process based on user ratings. The reinforcement learning part optimizes the model's weights using policy gradient methods, improving repair accuracy through continuous interaction and better aligning with user preferences. Users can adjust the repair effect in real time, and the system automatically learns and optimizes the handling of repaired and non-repaired areas.
[0110] This invention significantly improves the detail restoration of the repaired area while maintaining the naturalness of the overall image structure, and effectively avoids modification of undamaged areas. Through a structure-aware competitive gating control mechanism, this invention can maintain global consistency in structurally continuous regions and enhance detail restoration capabilities in regions with complex textures, thus significantly outperforming existing simple gating or attention fusion methods.
[0111] This invention employs an image restoration method based on gated AOT-Restormer fusion, combining multi-layer feature extraction and attention mechanisms. This enables precise restoration of damaged areas while preserving the overall structure and details of the image. This invention has wide applications in image restoration, image recovery, noise reduction, and damaged image repair, providing users with a high-quality, intelligent restoration experience.
[0112] The above description is only a preferred embodiment of the present invention. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.
Claims
1. An image inpainting method based on gated AOT-Restormer fusion, characterized in that, Includes the following steps: Step 1: Collect an image dataset containing diverse scenes and objects, obtain the original images and the images after masking, and construct paired datasets; Step 2: Construct a gated AOT-Restormer fusion GAN network, which includes a generator and a discriminator; introduce a structure-aware competitive gating fusion mechanism of the AOT-Restormer fusion module into the generator; the discriminator adopts the U-Net discriminator, which includes an encoder and a decoder, and introduces a resolution adaptive attention mechanism and multi-scale feature fusion into the U-Net discriminator; Step 3: Extract multi-scale features from the dataset using the encoder, process the features using the AOT-Restormer fusion module, and fuse local porous convolutional features and global self-attention features through a structure-aware competitive control mechanism to obtain fused features; progressively upsample the fused features through the decoder and reconstruct the image, perform adversarial training through the U-Net discriminator, adopt an end-to-end training strategy, jointly optimize the adversarial loss, feature matching loss and perceptual loss to obtain the optimized objective function, and thus obtain the trained gated AOT-Restormer fused GAN network; Step 4: Input the image to be repaired and the corresponding mask image into the trained gated AOT-Restormer fused GAN network for repair, and finally output the repair result.
2. The image inpainting method based on gated AOT-Restormer fusion according to claim 1, characterized in that: The AOT-Restormer fusion module extracts local and global features through dilated convolution and self-attention mechanisms, respectively, and adaptively fuses the local and global features using a structure-aware competitive control mechanism to obtain fused features. The AOT-Restormer fusion module includes an AOT submodule and a Restormer submodule. The AOT submodule includes a multi-scale feature extraction unit and a feature fusion unit. The multi-scale feature extraction unit is used to capture local details and global contextual information features of the image. The feature fusion unit integrates the multi-scale features extracted from each branch through channel concatenation and convolution to generate contextual features with rich semantic representation.
3. The image inpainting method based on gated AOT-Restormer fusion according to claim 2, characterized in that: The AOT-Restormer fusion module employs a structure-aware competitive gating control mechanism to adaptively fuse the output features of the AOT submodule and the Restormer submodule, specifically including: The structural feature map representing edge continuity and texture complexity is extracted from the input features by the structure-aware branch, and combined with the mask-guided features, it is input into the control network along with the multi-scale context features output by the AOT submodule and the global attention features output by the Restormer submodule. The control network generates competitive control weights corresponding to the AOT branch and the Restormer branch respectively through multi-layer convolution and Softmax normalization, so that the two form a complementary constraint relationship in the same spatial location. Based on the competitive adjustment weights, the output features of the AOT submodule and the output features of the Restormer submodule are fused element-wise to generate a fused feature that takes into account both local texture consistency and global structural coherence.
4. The image inpainting method based on gated AOT-Restormer fusion according to claim 3, characterized in that, The fusion feature satisfies: in, Indicates fusion features, These are the multi-scale contextual features output by the AOT module. These are attention-enhanced features output by the Restormer module. and The corresponding control weights are given, and the following conditions are met: in (·) is for regulating the network. To perceive structural features. For masking guidance features; The encoder extracts features through three levels of downsampling convolution; the decoder gradually restores the image resolution through upsampling convolution, and finally outputs the repair result.
5. The image inpainting method based on gated AOT-Restormer fusion according to claim 4, characterized in that: The objective function is optimized as follows: in, Indicates the total loss. , , The initial baseline weights for each loss term, It is the mask area ratio. It is structural complexity. The average structural complexity of the training set. This is the current repair confidence level. Indicates overall combat losses. The weighting coefficients represent the texture contrast loss. The weights represent the semantic contrast loss. Indicates texture contrast loss. This represents semantic contrast loss. This indicates pixel-level adversarial loss.
6. The image inpainting method based on gated AOT-Restormer fusion according to claim 5, characterized in that: The loss function is constructed by combining contrastive loss and adversarial loss, including: Contrast loss is used to extract multi-layer features through a discriminator, and the similarity between the repaired result and the real image at the texture and semantic levels is calculated. Adversarial loss is employed, and the realism of the repaired area is improved through a global discriminator and a pixel-level discriminator; L1 reconstruction loss constrains the restoration of pixel-level consistency between the image and the real image.
7. The image inpainting method based on gated AOT-Restormer fusion according to claim 6, characterized in that: The contrast loss employs texture contrast loss and semantic contrast loss, including: Using texture contrast loss, the L1 distance between the repaired feature and the real feature is calculated in the intermediate feature layer of the discriminator, and compared with the distance of the damaged feature. The formula is as follows: in, Indicates texture contrast loss. This represents the total number of feature layers used in the discriminator to calculate the texture contrast loss. This represents the feature map of the i-th layer. This represents the feature map of the real image in the i-th layer. This represents the feature map of the damaged image in layer i. For smoothing coefficients; Using semantic contrast loss, similarity is calculated in the final feature layer of the discriminator through normalization and cross-entropy loss, as shown in the formula: in, This represents semantic contrast loss. This represents the repaired image features output by the final feature layer of the discriminator. This represents the true image features output by the final feature layer of the discriminator. MLP is a dimensionality reduction network, and CE is the cross-entropy loss function. Using texture separability and semantic separability, the formulas are as follows: Where x is the original real image. It's a corrupted image. This is the output restored image. It is the U-Net discriminator. is the feature map output by the i-th intermediate feature layer of the discriminator, and N is the number of discriminator feature layers used to calculate texture contrast. It is the L1 norm, which is the sum of the absolute values of each element of the feature map: in It is the embedding vector of the real image in the discriminator's semantic space. It involves repairing the image's embedding vector in the discriminator's semantic space. It is cosine similarity; Then, the weight coefficients of the texture contrast loss are obtained using Softmax normalization. Weight coefficients of semantic contrast loss : in and These are texture separability and semantic separability, respectively. The total comparison loss is: in, Indicates the total comparative loss. The weighting coefficients represent the texture contrast loss. The weight coefficients represent the semantic contrast loss. The adversarial loss employs both global adversarial loss and pixel-level adversarial loss, including: Using global adversarial loss, the discriminator outputs global true / false labels: in, Indicates overall combat losses. Represents a generator. Indicates the discriminator, This represents the expectation of the distribution of real image data. This represents the expectation of the distribution of the generated image data. This represents the discriminator's judgment result on the real image. This indicates the discriminator's judgment result on the generated image; Using pixel-level adversarial loss, we train by segmentation and confusion adversarial methods, calculating the loss separately in the masked and unmasked regions. The L1 reconstruction loss constrains the restoration of pixel-level consistency between the original image and the ground truth image. in, Indicates pixel-level adversarial loss. Indicates image repair. Represents a real image; The weighting coefficients for each factor, based on mask region complexity and repair confidence, are as follows: in As the initial baseline weights, It is the mask area ratio. It is structural complexity. The average structural complexity of the training set. This is the current repair confidence level; The overall optimization objective function is composed of the following weighted loss functions: in, Indicates the total loss. These are the weighting coefficients for each loss term.
8. The image inpainting method based on gated AOT-Restormer fusion according to claim 7, characterized in that: The methods for constructing paired datasets in step 1 include: By combining a random mask generation algorithm with a semantically guided mask generation strategy, a diverse mask library containing different shapes, sizes, and semantic regions is constructed. At the same time, based on image content complexity analysis and regional saliency detection, the distribution density and coverage of the masks are adaptively adjusted to form training sample pairs with multi-level repair difficulty.
9. The image inpainting method based on gated AOT-Restormer fusion according to claim 8, characterized in that: A multi-scale feature interaction mechanism is employed, and the process includes: The input features are transformed into query Q, key K, and value V through three 1×1 convolutions, respectively, where K and V are downsampled by 2×2 max pooling. Then, the similarity matrix between Q and transpose K is calculated and normalized by softmax to obtain the attention weights, which are then multiplied by the downsampled V and upsampled to the original resolution. Finally, the attention features are fused with the original input features through a learnable parameter γ, which is initially 0 and adaptively adjusted during training to achieve feature enhancement while maintaining spatial structure consistency.
10. An image inpainting system based on gated AOT-Restormer fusion, characterized in that, The image inpainting method based on gated AOT-Restormer fusion as described in claim 1 includes an input unit, a GAN network unit based on gated AOT-Restormer fusion, and an output unit, wherein: The input unit is used to input a collected image dataset containing diverse scenes and objects, obtain its original image and masked image, and construct a paired dataset; it is also used to input the image to be repaired and the corresponding mask image. The gated AOT-Restormer fusion GAN network unit is used to construct the gated AOT-Restormer fusion GAN network, which includes a generator and a discriminator. A structure-aware competitive gating fusion mechanism from the AOT-Restormer fusion module is introduced into the generator. The discriminator uses a U-Net discriminator, which includes an encoder and a decoder. A resolution-adaptive attention mechanism and multi-scale feature fusion are introduced into the U-Net discriminator. Multi-scale features are extracted from the dataset using the encoder, and then used for AOT-Restormer fusion. The Restormer fusion module processes features by fusing local porous convolutional features and global self-attention features through a structure-aware competitive control mechanism to obtain fused features. These fused features are then progressively upsampled by a decoder to reconstruct the image. Adversarial training is performed using a U-Net discriminator, employing an end-to-end training strategy to jointly optimize adversarial loss, feature matching loss, and perceptual loss, resulting in a trained gated AOT-Restormer fusion GAN network. The image to be repaired and its corresponding mask image are then input into the trained gated AOT-Restormer fusion GAN network for repair, yielding the repaired image. The output unit is used to output the repaired image.