Image inpainting method and device fusing local self-adaptation and global dynamic receptive field

CN122265099APending Publication Date: 2026-06-23INST OF AUTOMATION CHINESE ACAD OF SCI

0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: INST OF AUTOMATION CHINESE ACAD OF SCI
Filing Date: 2026-02-25
Publication Date: 2026-06-23

Smart Images

Figure CN122265099A_ABST

Patent Text Reader

Abstract

This invention provides an image inpainting method and apparatus that integrates local adaptation and global dynamic receptive field, relating to the field of image processing. The method includes: concatenating the image to be inpainted with its binary mask to obtain concatenated data; performing multi-scale downsampling on the concatenated data based on an encoder to obtain a multi-scale encoded tensor; during the downsampling process, performing local feature extraction and global semantic modeling on the intermediate encoded tensor of the multi-scale encoded tensor based on a hybrid receptive field module; and upsampling and reconstructing the multi-scale encoded tensor based on a decoder to obtain a reconstructed image of the image to be inpainted. Through the synergy of local detail capture and global semantic modeling, the method dynamically adapts to the morphology of missing regions when extracting local information for detail restoration, while effectively modeling long-distance dependencies to ensure the coherence of the global structure, thereby significantly improving the accuracy of image inpainting.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image processing technology, and in particular to an image restoration method and apparatus that integrates local adaptive and global dynamic receptive field. Background Technology

[0002] Image inpainting techniques aim to fill in missing areas based on known information about an image in order to restore its visual integrity.

[0003] Existing image inpainting architectures primarily rely on deep learning-based generative models, commonly including single-stage and two-stage (coarse-to-fine) generation methods. When processing local features and global semantics, these architectures often lack deep feature interaction and fusion, resulting in inconsistent texture details and overall structure in the generated inpainted regions. This easily leads to artifacts or structural distortions, ultimately resulting in low image inpainting accuracy. Summary of the Invention

[0004] This invention provides an image restoration method and apparatus that integrates local adaptive and global dynamic receptive field to improve the accuracy of image restoration.

[0005] This invention provides an image inpainting method that integrates local adaptation and global dynamic receptive field, comprising the following steps: The image to be repaired is concatenated with the binary mask of the image to be repaired to obtain concatenated data; The multi-scale encoded tensor of the spliced data is obtained by downsampling the spliced data at multiple scales based on the encoder. During the downsampling process of the encoder, local feature extraction and global semantic modeling are performed on the intermediate coding tensor in the multi-scale coding tensor based on the hybrid receptive field module. The hybrid receptive field module includes a cascaded adaptive large kernel convolutional unit and a multi-head attention mechanism unit. The adaptive large kernel convolutional unit is used to extract local adaptive neighborhood features of the intermediate coding tensor based on static convolution and dynamic deformable convolution to adapt to the missing regions indicated by the binary mask. The multi-head attention mechanism unit is used to perform long-distance dependency modeling on the local adaptive neighborhood features processed by the adaptive large kernel convolutional unit based on a self-attention mechanism to generate global context features. The multi-scale coded tensor is upsampled and reconstructed based on the decoder to obtain the reconstructed image of the image to be repaired.

[0006] According to the image inpainting method integrating local adaptation and global dynamic receptive field provided by the present invention, the step of upsampling and reconstructing the multi-scale coded tensor based on the decoder to obtain the reconstructed image of the image to be inpainted further includes: During the upsampling process of the decoder, the encoding tensor and decoding tensor of the same scale are fused based on the supervision and guidance module, and the image is reconstructed based on the fused tensor to obtain the reconstructed image of the image to be repaired. The supervision and guidance module is used to predict fusion weights based on the image to be repaired as a supervision signal, and to fuse the encoded tensor and the decoded tensor based on the fusion weights.

[0007] According to the present invention, an image inpainting method integrating local adaptation and global dynamic receptive field is provided. The adaptive large kernel convolutional unit includes a parallel large kernel convolutional module and a dynamic convolutional module. The extraction of local adaptive neighborhood features of the intermediate encoded tensor based on static convolution and dynamic deformable convolution includes: The intermediate encoding tensor is convolved with static large kernels of different sizes in the large kernel convolution module to obtain static spatial features; The intermediate encoding tensor is convolved based on the deformable convolution in the dynamic convolution module to obtain dynamic shape adaptation features. The static spatial features and the dynamic shape adaptation features are spliced and fused to obtain the fused features; The fused features are weighted and combined with the intermediate encoding tensor through a gating mechanism to obtain the local adaptive neighborhood features.

[0008] According to the image inpainting method that integrates local adaptation and global dynamic receptive field provided by the present invention, the large kernel convolution module is constructed based on spatial small convolution, spatial large convolution and channel convolution.

[0009] According to the present invention, an image inpainting method integrating local adaptation and global dynamic receptive field is provided, wherein the method performs long-distance dependency modeling on the local adaptive neighborhood features processed by the adaptive large kernel convolutional unit based on a self-attention mechanism to generate global context features, including: Learnable position embedding codes are added to the local adaptive neighborhood features to obtain an input vector with positional information; The input vector is subjected to layer normalization to generate the query vector, key vector, and value vector of the input vector; The query vector, the key vector, and the value vector are split into multiple parallel attention heads; Calculate the scaled dot product attention in each attention head to obtain the attention output of each attention head; The outputs of all attention heads are concatenated, and the concatenated data is nonlinearly transformed through a feedforward neural network layer to generate the global context features.

[0010] According to the image inpainting method that integrates local adaptation and global dynamic receptive field provided by the present invention, the encoder and the decoder are obtained by joint training based on the target loss function; The target loss function is constructed based on reconstruction loss, perception loss, style loss, and total variation loss; The reconstruction loss is used to characterize the difference between the reconstructed image and the original image at the pixel level; The perceptual loss is used to characterize the difference in deep semantic features between the reconstructed image and the original image; The style loss is used to characterize the consistency of texture distribution between the reconstructed image and the original image; The total variation loss is used to characterize the spatial smoothness of the reconstructed image.

[0011] The present invention also provides an image inpainting device that integrates local adaptation and global dynamic receptive field, comprising the following modules: The stitching module is used to stitch the image to be repaired with the binary mask of the image to be repaired to obtain stitched data; The downsampling module is used to perform multi-scale downsampling on the spliced data based on the encoder to obtain the multi-scale encoded tensor of the spliced data; During the downsampling process of the encoder, local feature extraction and global semantic modeling are performed on the intermediate coding tensor in the multi-scale coding tensor based on the hybrid receptive field module. The hybrid receptive field module includes a cascaded adaptive large kernel convolutional unit and a multi-head attention mechanism unit. The adaptive large kernel convolutional unit is used to extract local adaptive neighborhood features of the intermediate coding tensor based on static convolution and dynamic deformable convolution to adapt to the missing regions indicated by the binary mask. The multi-head attention mechanism unit is used to perform long-distance dependency modeling on the local adaptive neighborhood features processed by the adaptive large kernel convolutional unit based on a self-attention mechanism to generate global context features. The repair module is used to upsample and reconstruct the multi-scale coded tensor based on the decoder to obtain the reconstructed image of the image to be repaired.

[0012] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the image inpainting method that integrates local adaptation and global dynamic receptive field as described above.

[0013] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the image inpainting method that integrates local adaptation and global dynamic receptive field as described above.

[0014] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the image inpainting method that integrates local adaptation and global dynamic receptive field as described above.

[0015] This invention provides an image inpainting method and apparatus that integrates local adaptive and global dynamic receptive fields. By introducing a hybrid receptive field module during encoder downsampling, and cascading an adaptive large-kernel convolutional unit and a multi-head attention mechanism unit, comprehensive capture of image features is achieved. The adaptive large-kernel convolutional unit combines static convolution and dynamically deformable convolution, flexibly adjusting the receptive field according to the shape of the missing region indicated by the binary mask, accurately extracting local adaptive neighborhood features, and effectively preserving edge and texture details. Simultaneously, the multi-head attention mechanism unit performs long-distance dependency modeling on these features, supplementing global contextual information and ensuring the semantic coherence of the inpainted content. Through the synergy of local detail capture and global semantic modeling, the morphology of the missing region can be dynamically adapted when extracting local information for detail restoration, while effectively modeling long-distance dependencies to ensure the coherence of the global structure, thereby significantly improving the accuracy of image inpainting. Attached Figure Description

[0016] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0017] Figure 1 This is a flowchart illustrating the image inpainting method that integrates local adaptive and global dynamic receptive fields provided by the present invention.

[0018] Figure 2 This is a schematic diagram of the encoder-decoder architecture provided by the present invention.

[0019] Figure 3 This is a schematic diagram of the supervision and guidance module provided by the present invention.

[0020] Figure 4 This is a schematic diagram of the ALKC architecture provided by the present invention.

[0021] Figure 5 This is a schematic diagram of the structure of the multi-head attention mechanism unit provided by the present invention.

[0022] Figure 6 This is a schematic diagram of the image restoration device that integrates local adaptive and global dynamic receptive fields provided by the present invention.

[0023] Figure 7 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation

[0024] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0025] The relevant methods for deep learning-driven image inpainting techniques can be broadly classified into three categories: single generation method, double generation method, and progressive method.

[0026] Single-generation methods typically use a single generator network to directly predict the content of missing regions.

[0027] Two-step generation methods typically employ two generators to achieve progressive completion using either a "coarse-to-fine" or "structure-texture" pipeline. In a coarse-to-fine framework, the first generator produces an initial result, while a secondary generator further refines it. A progressive completion framework usually contains multiple generators or a single generator with multiple iteration mechanisms, achieving high-quality filling by gradually reducing the missing region.

[0028] While the aforementioned methods have made some progress in image inpainting tasks, several key shortcomings remain, making it difficult to meet the high quality requirements of inpainting in complex scenes. Specifically, although the two-stage method introduces local and global receptive fields respectively, its cascaded dual-network architecture results in a lack of deep integration between local detail restoration and global semantic modeling at the feature level. This not only causes computational redundancy but also easily leads to the propagation of errors step by step. On the other hand, the single-stage method based on Transformer can effectively model long-distance dependencies through self-attention mechanisms, but it neglects the fine restoration of local high-frequency details (such as edges and textures) due to over-reliance on the global context, resulting in blurring or structural distortion in local areas of the inpainting results. In addition, existing methods generally adopt fixed receptive fields or single context modeling strategies, making it difficult to dynamically adjust the information aggregation range according to the shape, size, and content complexity of the missing region, and lacking adaptability to diverse missing patterns.

[0029] Improving the accuracy of image restoration is a crucial issue that the industry urgently needs to address.

[0030] To address the shortcomings of existing methods, this invention provides an image inpainting method that integrates local adaptation and global dynamic receptive field. Figure 1This is a flowchart illustrating the image inpainting method that integrates local adaptive and global dynamic receptive fields provided by the present invention, as shown below. Figure 1 As shown, the method includes the following: Step 110: Concatenate the image to be repaired with the binary mask of the image to be repaired to obtain concatenated data; Step 120: Perform multi-scale downsampling on the spliced data based on the encoder to obtain the multi-scale encoded tensor of the spliced data; During the downsampling process of the encoder, local feature extraction and global semantic modeling are performed on the intermediate coding tensor in the multi-scale coding tensor based on the hybrid receptive field module. The hybrid receptive field module includes a cascaded adaptive large kernel convolutional unit and a multi-head attention mechanism unit. The adaptive large kernel convolutional unit is used to extract local adaptive neighborhood features of the intermediate coding tensor based on static convolution and dynamic deformable convolution to adapt to the missing regions indicated by the binary mask. The multi-head attention mechanism unit is used to perform long-distance dependency modeling on the local adaptive neighborhood features processed by the adaptive large kernel convolutional unit based on a self-attention mechanism to generate global context features. Step 130: Upsample and reconstruct the multi-scale coded tensor based on the decoder to obtain the reconstructed image of the image to be repaired.

[0031] The following example, using a computer executing the image inpainting method that integrates local adaptive and global dynamic receptive fields provided by this invention, illustrates the technical solution of this invention in detail.

[0032] In step 110, the image to be repaired needs to be acquired first. After acquiring the image to be repaired, the image to be repaired is concatenated with its corresponding binary mask to obtain concatenated data for subsequent processing.

[0033] The image to be repaired can be any digital image that contains missing, damaged, or areas that need to be modified. For example, it could be a photo with scratches or stains, an image in which the user intentionally or unintentionally removed certain objects during the editing process (such as removing passersby from a photo), or an image that needs content filling and repair in digital content creation.

[0034] A binary mask is a two-dimensional matrix or tensor with the same spatial dimensions as the image to be repaired. The purpose of a binary mask is to identify which regions in the image need repair and which are known, intact regions. For example, specific pixel values can be used to distinguish them: a pixel value of 1 in the mask represents a missing pixel in the image to be repaired, while a pixel value of 0 represents a known pixel. Binary masks can be automatically generated using algorithms.

[0035] By concatenating the image to be repaired with its corresponding binary mask, the two can be connected along the channel dimension. For example, if the image to be repaired is an RGB three-channel color image (size H×W×3), and the binary mask is a single-channel image (size H×W×1), the resulting concatenated data is a four-channel tensor (size H×W×4). This integrates the location information of the missing region with the pixel content information of the image, serving as a unified input for subsequent neural network models.

[0036] In step 120, the spliced data is downsampled at multiple scales based on the encoder to obtain the multi-scale encoded tensor of the spliced data.

[0037] An encoder is a deep neural network structure composed of a series of convolutional layers, normalization layers, activation functions, and downsampling layers. Its core function is to continuously extract and abstract higher-dimensional, more semantic features while processing input data layer by layer, and to gradually reduce the spatial resolution of the feature maps through downsampling operations (e.g., using convolutional or pooling layers with a stride greater than 1). A multi-scale encoding tensor refers to the intermediate feature tensor output at different depths (i.e., different resolutions or scales) of the encoder. This multi-scale structure helps the network learn both detailed and global structural information of the image simultaneously.

[0038] It should be noted that during the downsampling process of the encoder, local feature extraction and global semantic modeling are performed on one or more multi-scale coding tensors (i.e., intermediate coding tensors) based on the hybrid receptive field module. This hybrid receptive field module is composed of cascaded adaptive large kernel convolutional units and multi-head attention mechanism units.

[0039] Cascading refers to the process where the intermediate encoded tensor is first processed by an adaptive large kernel convolutional unit, and its output is then fed into a multi-head attention mechanism unit for further processing.

[0040] The core function of the adaptive large-kernel convolutional unit is to extract local adaptive neighborhood features from the intermediate encoded tensor, which are used to collect local contextual information around the missing region for repair purposes. "Adaptive" refers to the ability of the adaptive large-kernel convolutional unit to dynamically adjust the scope and method of its information aggregation based on the different shapes, sizes, and complexities of the missing regions indicated by the binary mask. This is achieved by combining static convolution (e.g., using a fixed-shape but large-sized convolutional kernel) with dynamically deformable convolution. Static convolution is responsible for efficiently capturing regular local patterns, while dynamically deformable convolution learns offsets to make the sampling points of the convolutional kernel irregularly distributed, thereby accurately adapting to irregular missing boundaries and internal structures.

[0041] The multi-head attention mechanism unit is used to model long-distance dependencies in features already rich in local information obtained after processing by adaptive large-kernel convolutional units, thereby generating global contextual features. It is built on a self-attention mechanism, capable of calculating the weights of relationships between any two locations in the feature map, unrestricted by spatial distance. This allows the model to understand and utilize semantically relevant cues from afar to guide the restoration process. For example, when restoring a face, the features of one eye can be used to infer the appropriate style for the other occluded eye. The global contextual features generated in this way ensure that the restored content maintains a high degree of consistency with the overall image background in terms of structure, style, and semantics.

[0042] In step 130, the multi-scale coded tensor output by the encoder, which has been processed by the hybrid receptive field module, is upsampled and reconstructed based on the decoder to finally obtain the repaired reconstructed image.

[0043] The decoder's structure is typically symmetrical to the encoder's, consisting of a series of upsampling layers (such as transposed convolutions or pixel reconstructions), convolutional layers, normalization layers, and activation functions. Its role is to progressively and scale-wise reconstruct the abstract semantic features extracted by the encoder into an image with high-resolution pixel details. The final output of the upsampling reconstruction process is the reconstructed image, which is the target result of this method. Its size is the same as the original image to be repaired, and the missing regions marked by binary masks have been filled with appropriate content.

[0044] The image inpainting method of the present invention, which integrates local adaptive and global dynamic receptive fields, is compared using five evaluation metrics, including three pixel-level measurement metrics: Error, Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index (SSIM); and two depth feature measurement metrics: FID (Fréchet inception distance) and LPIPS (Learned Perceptual Image Patch Similarity). Comparison methods include: MEDFE (Mutual Encoder-Decoder with Feature Equalization), RFR (Recurrent Feature Reasoning), CTSDG (Conditional Texture and Structure Dual Generation), LGNet (Local-Global Network), MAT (Mask-Aware Transformer), MISF (Multi-level Interactive Siamese Filtering), TSGDAM (Texture and Structure-Guided Dual-Attention Mechanism), and the method of this invention (Ours). Mask represents the mask ratio. Specific results are shown in Table 1, Evaluation Results Table: Table 1. Evaluation Results of Indicators In all five metrics of the Paris dataset, the image inpainting method fusing local adaptation and global dynamic receptive field demonstrated superior inpainting performance across different missing region proportions. However, the quantitative results for MEDFE and CTSDG were relatively poor. Compared to this method, MAT performed better in the table above. The FID of the masked region is slightly lower. This may be because the explicit mask-aware design in MAT is more effective at filling large missing regions.

[0045] This invention provides an image inpainting method that integrates local adaptive and global dynamic receptive fields. By introducing a hybrid receptive field module during encoder downsampling, this module cascades adaptive large-kernel convolutional units and multi-head attention mechanisms, achieving comprehensive capture of image features. The adaptive large-kernel convolutional unit combines static and dynamically deformable convolutions, flexibly adjusting the receptive field according to the shape of the missing region indicated by the binary mask. This accurately extracts local adaptive neighborhood features, effectively preserving edge and texture details. Simultaneously, the multi-head attention mechanism performs long-distance dependency modeling on these features, supplementing global contextual information and ensuring the semantic coherence of the inpainted content. Through the synergy of local detail capture and global semantic modeling, the method dynamically adapts to the shape of the missing region when extracting local information for detail restoration, while effectively modeling long-distance dependencies to ensure the coherence of the global structure, thereby significantly improving the accuracy of image inpainting.

[0046] In one embodiment, the step of upsampling and reconstructing the multi-scale coded tensor based on the decoder to obtain the reconstructed image of the image to be repaired further includes: During the upsampling process of the decoder, the encoding tensor and decoding tensor of the same scale are fused based on the supervision and guidance module, and the image is reconstructed based on the fused tensor to obtain the reconstructed image of the image to be repaired. The supervision and guidance module is used to predict fusion weights based on the image to be repaired as a supervision signal, and to fuse the encoded tensor and the decoded tensor based on the fusion weights.

[0047] During the upsampling process, the decoder will fuse the encoder tensor on the encoder side and the decoder tensor on the decoder side at the same scale based on the supervision and guidance module, and then perform subsequent reconstruction based on the fused tensor.

[0048] In encoder-decoder architectures, to prevent the vanishing gradient problem during deep network training and to allow the decoder to utilize detailed features extracted by the encoder in shallow layers, skip connections are introduced. These connections directly pass the encoder's feature maps and fuse them into the corresponding scale layers of the decoder. However, this simple fusion method treats all features equally, while in reality, for inpainting tasks, the importance of features from the encoder (containing real information about known regions) and features from the decoder (containing inferred inpainting information from semantics) is dynamically changing.

[0049] Therefore, a supervised guidance module was introduced. The supervised guidance module is a learnable intelligent fusion mechanism. For each upsampling stage of the decoder, it receives two inputs: one is the encoded tensor, which is at the same scale as the current stage, and is received from the encoder through skip connections; the other is the decoded tensor, which is at the current scale and is obtained after upsampling in the previous stage of the decoder.

[0050] The supervised guidance module predicts optimal fusion weights based on the original, intact real image corresponding to the image to be restored, which can be referred to as the supervision signal. Specifically, during the model training phase, this module processes the input encoding and decoding tensors in parallel, predicting preliminary versions of their respective reconstructed images through small convolutional networks. These two preliminary reconstruction results are then compared with the real, undamaged image, and backpropagated through an additional loss function (such as a guidance loss), thereby training the network to learn to generate appropriate fusion weights for the encoding and decoding tensors. These weights are spatially variable, meaning that at different locations in the image, the module can decide whether to trust more of the known information from the encoder or rely more on the generated information from the decoder.

[0051] After obtaining the fusion weights, the supervision and guidance module performs a weighted summation of the encoding and decoding tensors based on these weights, generating a fusion tensor that is more information-rich and accurate. This fusion tensor is then used for subsequent convolutional operations in the current stage of the decoder and for upsampling in the next stage.

[0052] Optionally, the constructed encoder-decoder architecture can be as follows: Figure 2 The encoder-decoder architecture provided by this invention is illustrated in the diagram. This architecture takes the concatenation of the binary masks of the images to be repaired as input and outputs a reconstructed image. The architecture comprises four downsampling stages and four upsampling stages. To learn rich representational information and capture various effective information required for filling defects, we propose a static-to-dynamic, local-to-global architecture design. Specifically, we construct a hybrid receptive field module by cascading adaptive large kernel convolutional units (ALKC) and multi-head attention mechanisms (MHAT) and introducing skip connections. This module is applied to multi-scale processing. Furthermore, a supervised guidance module (SGM) is introduced to achieve high-fidelity filling through adaptive feature aggregation between the encoder and decoder stages.

[0053] In the encoder, C64S2 represents a convolutional layer with 64 output channels and a stride of 2, used for downsampling. C128S2 and C256S2 are subsequent convolutional layers that further increase the number of channels to 128 and 256 respectively, while reducing the feature map size. ReLU is the activation function. In the decoder, D128S2 and D64S2 correspond to the encoder layers, progressively reducing the number of channels and increasing the feature map size.

[0054] The structural diagram of the supervision and guidance module can be shown as follows: Figure 3 The structural diagram of the supervision and guidance module provided by this invention is shown.

[0055] The supervised guidance module can adaptively predict fusion weights. This prediction process is trained under the supervision of the ground truth, which is only used during the training phase.

[0056] Specifically, given two input features and The image is obtained through two convolutional layers. / ): , Subsequently, the weights are obtained through two additional convolutional layers and the sigmoid function. , ): , The final features are summed using a weighted average to obtain the fusion weights. : ; In one embodiment, the adaptive large-kernel convolutional unit includes a parallel large-kernel convolutional module and a dynamic convolutional module. The extraction of local adaptive neighborhood features of the intermediate encoded tensor based on static convolution and dynamic deformable convolution includes: The intermediate encoding tensor is convolved with static large kernels of different sizes in the large kernel convolution module to obtain static spatial features; The intermediate encoding tensor is convolved based on the deformable convolution in the dynamic convolution module to obtain dynamic shape adaptation features. The static spatial features and the dynamic shape adaptation features are spliced and fused to obtain the fused features; The fused features are weighted and combined with the intermediate encoding tensor through a gating mechanism to obtain the local adaptive neighborhood features.

[0057] Static spatial features are obtained by performing convolution operations on intermediate encoded tensors using static large kernels of different sizes within the large kernel convolution module. The role of the large kernel convolution module is to capture a large range of contextual information using a large, fixed-shape receptive field. To maintain a large receptive field while controlling computational cost, a large kernel convolution operation can be decomposed into a combination of several less computationally expensive convolution operations. For example, a K×K large kernel convolution can be decomposed into a depthwise separable convolution (including a spatial convolution and a pointwise convolution). Multiple branches can be set up in parallel within this module, each using a large kernel of a different size (e.g., equivalent to 5×5, 7×7, 11×11, etc. after decomposition), thereby capturing multi-scale static contextual information. The outputs of these branches collectively constitute the static spatial features.

[0058] Simultaneously, deformable convolution in the dynamic convolution module is used to convolve the intermediate encoded tensor to obtain dynamically shape-adaptive features. The core of the dynamic convolution module is deformable convolution. Unlike standard convolution, which samples at fixed grid points, deformable convolution additionally learns an offset field. This offset field acts on the sampling points of the convolution kernel, allowing it to dynamically change shape according to the input features. This enables the sampling points of the convolution kernel to cluster near the boundaries of the missing region or on meaningful textures in image inpainting tasks, thus accurately capturing neighborhood information of irregular shapes required for inpainting.

[0059] Static spatial features and dynamic shape adaptation features are concatenated and fused to obtain a fused feature. The concatenation operation usually involves connecting the two feature tensors along the channel dimension, and then using a 1×1 convolutional layer to fuse information and adjust the number of channels, thereby obtaining a powerful fused feature that combines a large static context with dynamic local adaptability.

[0060] The fused features are weighted and combined with the intermediate encoding tensor of the original input using a gating mechanism to obtain the final output, namely the locally adaptive neighborhood features. The gating mechanism is an adaptive residual connection method. Specifically, a gating branch containing convolution and a sigmoid activation function can be applied to the intermediate encoding tensor of the original input to generate a weight mask (gating weight) of the same size as the feature map with values between 0 and 1. The final output is the product of the fused features and the gating weight, plus the product of the original input features and (1 - gating weight). The gating mechanism allows the network to dynamically decide at each spatial location whether to retain more of the original features or adopt more of the newly extracted fused features, thereby achieving adaptive feature updates.

[0061] Optionally, the architecture diagram of the Adaptive Large Kernel Convolutional Unit (ALKC) can be as follows: Figure 4 The schematic diagram of the ALKC architecture provided by this invention is shown.

[0062] The adaptive large kernel convolutional unit comprises multiple parallel large kernel convolutional layers, labeled LCov-K5d2, LCov-K7d2, LCov-K14d3, and LCov-K21d3. K represents the kernel size, and d represents the dilation rate. For example, K21d3 indicates a kernel size of 21×21 and a dilation rate of 3. The deformable convolutional unit contains multiple parallel deformable convolutional layers, labeled DefCov-d2 and DefCov-d3. It receives output features from all parallel branches of LCov and DefCov. Static multi-scale features are concatenated with dynamic shape-adaptive features along the channel dimension, and then fused. This step generates the fused features. The Interpolation module is used to perform gated residual connection operations.

[0063] Specifically, standard large-kernel convolution can be decomposed into three parts: spatial small convolution (local window), spatial large convolution (long-range window), and channel convolution. For a kernel size of... Expansion rate The large kernel convolution can be decomposed in detail as follows: The processing procedure for spatially small convolutions is as follows: ; The processing procedure for spatial large convolution is as follows: ; The processing procedure for channel convolution is as follows: ; Where dwcov represents depthwise convolution, and cov represents standard convolution.

[0064] Unlike the LKC method, which uses a static convolution window, deformable convolution introduces a dynamic strategy to determine the convolution window. In standard 2D convolution, at each location... Output It can be obtained through the following formula: ; in It is a convolution window containing R positions. This is the relative position offset. To achieve deformable convolution, the regular region is dynamically expanded by introducing an offset at each position. The formula is modified to: ; This offset causes the convolution window to take on an irregular shape. Technically, the offset is achieved by adjusting the input features. Standard convolution is applied for prediction.

[0065] The learned features are concatenated and fused through multiple parallel large-kernel convolution (LCov) and deformable convolution (DefCov) operations (using different kernel sizes and dilation rates). The fused features... With original features They have the same shape and are used for subsequent polymerization operations.

[0066] The model employs gated residual connections to adaptively fuse learned residual features with input features, enabling it to update internal features of holes under spatially varying patterns while preserving fine details of the external features. (Regarding input features...) The application includes a gating module containing convolution, normalization, and the sigmoid function to generate... Gating weights of the same shape Subsequently through right and Perform interpolation: ; in This represents element-wise multiplication.

[0067] In one embodiment, the large kernel convolution module is constructed based on spatial small convolution, spatial large convolution, and channel convolution.

[0068] To effectively reduce the number of parameters and computational complexity while maintaining a large receptive field, large kernel convolution modules can be constructed based on spatial small convolution, spatial large convolution, and channel convolution. This structure is an effective decomposition of the standard large kernel convolution.

[0069] Specifically, a standard large-kernel (e.g., K×K) convolution operation can be broken down into the following three steps performed sequentially: Spatially small convolutions: First, a depthwise separable convolution with a small kernel size (e.g., d×d, where d is much smaller than K) is applied to the input feature map to capture local spatial relationships. This corresponds to operations within a local window.

[0070] Spatially Large Convolution: Next, another depthwise separable convolution is applied to the output feature map from the previous step, but with a larger kernel size (e.g., (K / d) × (K / d)) and an appropriate dilation rate. This step is used to propagate information over a larger spatial range, achieving the effect of a long-range window.

[0071] Channel convolution: Finally, a standard 1×1 convolution (also known as pointwise convolution) is applied to mix information along the channel dimension.

[0072] Through the above decomposition, a large-kernel convolutional layer can be equivalently simulated using three convolutional layers with relatively low computational cost. This approach greatly reduces the number of parameters and computational cost of the model, making it possible to apply very large receptive fields in the network without incurring excessive computational burden.

[0073] In one embodiment, the step of modeling long-distance dependencies of the local adaptive neighborhood features processed by the adaptive large-kernel convolutional unit based on a self-attention mechanism to generate global context features includes: Learnable position embedding codes are added to the local adaptive neighborhood features to obtain an input vector with positional information; The input vector is subjected to layer normalization to generate the query vector, key vector, and value vector of the input vector; The query vector, the key vector, and the value vector are split into multiple parallel attention heads; Calculate the scaled dot product attention in each attention head to obtain the attention output of each attention head; The outputs of all attention heads are concatenated, and the concatenated data is nonlinearly transformed through a feedforward neural network layer to generate the global context features.

[0074] The core objective of the multi-head attention mechanism unit is to model long-distance dependencies on local adaptive neighborhood features processed by adaptive large-kernel convolutional units based on self-attention, thereby generating global contextual features. Its specific implementation process may include the following steps: A learnable positional embedding encoding is added to the input local adaptive neighborhood feature tensor to obtain an input vector with positional information. Since the self-attention mechanism is permutation-invariant (i.e., it doesn't care about the order of input elements), positional information must be explicitly introduced. The positional embedding encoding is a learnable parameter with the same dimension as the input features; it is added to the input features, enabling the model to distinguish features from different locations in the image.

[0075] The input vector containing location information is subjected to layer normalization, and then different linear projection layers are used to generate query, key, and value vectors, respectively. Layer normalization helps stabilize the training process. Query, key, and value are the three core elements of the self-attention mechanism, and they are all derived from the same input vector.

[0076] The query vector, key vector, and value vector are each split into multiple parallel attention heads. The original high-dimensional Q / K / V vectors are further divided into M low-dimensional sub-vectors along the channel dimension, with each sub-vector group forming a head. This allows different attention heads to learn to focus on different aspects of the input information (e.g., one head might focus on texture, while another might focus on structure), enabling the model to learn information jointly from different representation subspaces.

[0077] Scaled dot product attention is computed independently in each attention head to obtain the attention output of each attention head. The outputs of all attention heads are concatenated and then nonlinearly transformed through a feedforward neural network layer to generate global context features. The concatenation operation recombines the output vectors of the M heads into a high-dimensional vector. This is then typically passed through a linear projection layer, followed by a feedforward network (usually consisting of two linear layers and a nonlinear activation function) containing residual connections and layer normalization for further feature extraction and nonlinear transformation.

[0078] The structure of the multi-head attention mechanism unit can be as follows: Figure 5 The schematic diagram of the multi-head attention mechanism unit provided by this invention is shown. Position Embedding is the position embedding layer; Layer Normalization is the normalization layer; and Multi-Head Attention is the multi-head attention mechanism.

[0079] For the input feature tensor Add a learnable positional embedding layer to encode the position of the feature vector, thereby obtaining a new tensor. Subsequently, The input is fed into N stacked submodules. For each submodule of MHAT, the detailed processing can be represented as follows: ; ; in, Representation layer normalization. This is called multi-head attention. This represents a feedforward network.

[0080] In one embodiment, the encoder and the decoder are obtained through joint training based on a target loss function; The target loss function is constructed based on reconstruction loss, perception loss, style loss, and total variation loss; The reconstruction loss is used to characterize the difference between the reconstructed image and the original image at the pixel level; The perceptual loss is used to characterize the difference in deep semantic features between the reconstructed image and the original image; The style loss is used to characterize the consistency of texture distribution between the reconstructed image and the original image; The total variation loss is used to characterize the spatial smoothness of the reconstructed image.

[0081] It should be noted that the encoder and decoder are jointly trained based on a composite objective loss function. This objective loss function is composed of multiple weighted components, which may include reconstruction loss, perceptual loss, style loss, and total variation loss.

[0082] The target loss function is a mathematical expression that guides the learning of a neural network model; the model's parameters are optimized by minimizing the value of this function. A composite loss function is used to constrain the quality of the repair results from multiple dimensions.

[0083] Reconstruction loss: This term characterizes the pixel-level difference between the reconstructed image and the original, undamaged ground truth image. The most commonly used reconstruction loss is the L1 loss (mean absolute error). For finer control, different weights can be applied to pixels in known regions (mask 0) and pixels in unknown (reconstructed) regions (mask 1), as we are usually more concerned with the accuracy of the reconstructed region. This loss term ensures that the reconstructed result is as close as possible to the truth at the pixel level.

[0084] Perceptual loss: Used to characterize the difference in deep semantic features between the reconstructed image and the original image. Instead of directly comparing pixel values, this loss inputs both images into a pre-trained deep neural network and then compares the differences (e.g., L1 distance) between the feature maps output by their intermediate layers. Perceptual loss better measures the perceptual similarity of images, helping to generate visually more natural and plausible restoration results, and effectively avoiding the blurring problems that L1 loss can easily cause.

[0085] Style loss is used to characterize the consistency of texture distribution between the reconstructed image and the real image. It is typically achieved by calculating the difference between the Gram matrices of deep feature maps. The Gram matrix captures the correlations between features, thus representing the texture style of the image. Style loss helps generate restored areas with realistic texture details, rather than smooth color blocks.

[0086] Total Variation Loss: Characterized by the spatial smoothness of the reconstructed image. It is a regularization term that encourages the generation of smoother, less noisy images by penalizing excessive gradient variations between adjacent pixels. This helps eliminate isolated artifacts that may appear in the restoration result.

[0087] The sum of the above loss functions, weighted according to certain weights, constitutes the complete target loss function. During model training, the backpropagation algorithm is used to calculate the gradient of the loss function with respect to the model parameters, and the optimizer is used to update the parameters to continuously reduce the total loss until the model converges.

[0088] The image inpainting apparatus that integrates local adaptive and global dynamic receptive fields provided by the present invention is described below. The image inpainting apparatus that integrates local adaptive and global dynamic receptive fields described below can be referred to in correspondence with the image inpainting method that integrates local adaptive and global dynamic receptive fields described above.

[0089] like Figure 6 As shown, the device includes: The splicing module 610 is used to splice the image to be repaired with the binary mask of the image to be repaired to obtain spliced data; The downsampling module 620 is used to perform multi-scale downsampling on the spliced data based on the encoder to obtain the multi-scale encoded tensor of the spliced data; During the downsampling process of the encoder, local feature extraction and global semantic modeling are performed on the intermediate coding tensor in the multi-scale coding tensor based on the hybrid receptive field module. The hybrid receptive field module includes a cascaded adaptive large kernel convolutional unit and a multi-head attention mechanism unit. The adaptive large kernel convolutional unit is used to extract local adaptive neighborhood features of the intermediate coding tensor based on static convolution and dynamic deformable convolution to adapt to the missing regions indicated by the binary mask. The multi-head attention mechanism unit is used to perform long-distance dependency modeling on the local adaptive neighborhood features processed by the adaptive large kernel convolutional unit based on a self-attention mechanism to generate global context features. Repair module 630 is used to upsample and reconstruct the multi-scale coded tensor based on the decoder to obtain the reconstructed image of the image to be repaired.

[0090] The image inpainting device provided by this invention integrates local adaptive and global dynamic receptive fields. By introducing a hybrid receptive field module during encoder downsampling, and cascading an adaptive large-kernel convolutional unit and a multi-head attention mechanism unit, it achieves comprehensive capture of image features. The adaptive large-kernel convolutional unit combines static convolution and dynamic deformable convolution, flexibly adjusting the receptive field according to the shape of the missing region indicated by the binary mask, accurately extracting local adaptive neighborhood features, and effectively preserving edge and texture details. Simultaneously, the multi-head attention mechanism unit performs long-distance dependency modeling on these features, supplementing global contextual information and ensuring the semantic coherence of the inpainted content. Through the synergy of local detail capture and global semantic modeling, the device dynamically adapts to the shape of the missing region when extracting local information for detail restoration, while effectively modeling long-distance dependencies to ensure the coherence of the global structure, thereby significantly improving the accuracy of image inpainting.

[0091] In one embodiment, the repair module 630 is specifically used for: The step of upsampling and reconstructing the multi-scale coded tensor based on the decoder to obtain the reconstructed image of the image to be repaired further includes: During the upsampling process of the decoder, the encoding tensor and decoding tensor of the same scale are fused based on the supervision and guidance module, and the image is reconstructed based on the fused tensor to obtain the reconstructed image of the image to be repaired. The supervision and guidance module is used to predict fusion weights based on the image to be repaired as a supervision signal, and to fuse the encoded tensor and the decoded tensor based on the fusion weights.

[0092] In one embodiment, the downsampling module 620 is specifically used for: The adaptive large-kernel convolutional unit includes a parallel large-kernel convolutional module and a dynamic convolutional module; the extraction of local adaptive neighborhood features of the intermediate encoded tensor based on static convolution and dynamic deformable convolution includes: The intermediate encoding tensor is convolved with static large kernels of different sizes in the large kernel convolution module to obtain static spatial features; The intermediate encoding tensor is convolved based on the deformable convolution in the dynamic convolution module to obtain dynamic shape adaptation features. The static spatial features and the dynamic shape adaptation features are spliced and fused to obtain the fused features; The fused features are weighted and combined with the intermediate encoding tensor through a gating mechanism to obtain the local adaptive neighborhood features.

[0093] In one embodiment, the downsampling module 620 is further configured to: The large kernel convolution module is constructed based on spatial small convolution, spatial large convolution, and channel convolution.

[0094] In one embodiment, the downsampling module 620 is further configured to: The method involves modeling long-distance dependencies of the local adaptive neighborhood features processed by the adaptive large-kernel convolutional unit based on a self-attention mechanism to generate global context features, including: Learnable position embedding codes are added to the local adaptive neighborhood features to obtain an input vector with positional information; The input vector is subjected to layer normalization to generate the query vector, key vector, and value vector of the input vector; The query vector, the key vector, and the value vector are split into multiple parallel attention heads; Calculate the scaled dot product attention in each attention head to obtain the attention output of each attention head; The outputs of all attention heads are concatenated, and the concatenated data is nonlinearly transformed through a feedforward neural network layer to generate the global context features.

[0095] In one embodiment, the repair module 630 is further configured to: The encoder and decoder are determined based on joint training using the target loss function; The target loss function is constructed based on reconstruction loss, perception loss, style loss, and total variation loss; The reconstruction loss is used to characterize the difference between the reconstructed image and the original image at the pixel level; The perceptual loss is used to characterize the difference in deep semantic features between the reconstructed image and the original image; The style loss is used to characterize the consistency of texture distribution between the reconstructed image and the original image; The total variation loss is used to characterize the spatial smoothness of the reconstructed image.

[0096] Figure 7 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 7 As shown, the electronic device may include a processor 710, a communications interface 720, a memory 730, and a communication bus 740, wherein the processor 710, communications interface 720, and memory 730 communicate with each other via the communication bus 740. The processor 710 can call logical instructions in the memory 730 to execute an image inpainting method that fuses local adaptation and global dynamic receptive field. This method includes: concatenating the image to be inpainted with a binary mask of the image to be inpainted to obtain concatenated data. The multi-scale encoded tensor of the spliced data is obtained by downsampling the spliced data at multiple scales based on the encoder. During the downsampling process of the encoder, local feature extraction and global semantic modeling are performed on the intermediate coding tensor in the multi-scale coding tensor based on the hybrid receptive field module. The hybrid receptive field module includes a cascaded adaptive large kernel convolutional unit and a multi-head attention mechanism unit. The adaptive large kernel convolutional unit is used to extract local adaptive neighborhood features of the intermediate coding tensor based on static convolution and dynamic deformable convolution to adapt to the missing regions indicated by the binary mask. The multi-head attention mechanism unit is used to perform long-distance dependency modeling on the local adaptive neighborhood features processed by the adaptive large kernel convolutional unit based on a self-attention mechanism to generate global context features. The multi-scale coded tensor is upsampled and reconstructed based on the decoder to obtain the reconstructed image of the image to be repaired.

[0097] Furthermore, the logical instructions in the aforementioned memory 730 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, essentially, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0098] On the other hand, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being able to be stored on a non-transitory computer-readable storage medium, the computer program being executed by a processor, the computer being able to execute the image inpainting method fusion of local adaptation and global dynamic receptive field provided by the above methods, the method including: concatenating the image to be repaired with the binary mask of the image to be repaired to obtain concatenated data; The multi-scale encoded tensor of the spliced data is obtained by downsampling the spliced data at multiple scales based on the encoder. During the downsampling process of the encoder, local feature extraction and global semantic modeling are performed on the intermediate coding tensor in the multi-scale coding tensor based on the hybrid receptive field module. The hybrid receptive field module includes a cascaded adaptive large kernel convolutional unit and a multi-head attention mechanism unit. The adaptive large kernel convolutional unit is used to extract local adaptive neighborhood features of the intermediate coding tensor based on static convolution and dynamic deformable convolution to adapt to the missing regions indicated by the binary mask. The multi-head attention mechanism unit is used to perform long-distance dependency modeling on the local adaptive neighborhood features processed by the adaptive large kernel convolutional unit based on a self-attention mechanism to generate global context features. The multi-scale coded tensor is upsampled and reconstructed based on the decoder to obtain the reconstructed image of the image to be repaired.

[0099] In another aspect, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements an image inpainting method that integrates local adaptation and global dynamic receptive field provided by the above methods, the method comprising: concatenating the image to be inpainted with a binary mask of the image to be inpainted to obtain concatenated data; The multi-scale encoded tensor of the spliced data is obtained by downsampling the spliced data at multiple scales based on the encoder. During the downsampling process of the encoder, local feature extraction and global semantic modeling are performed on the intermediate coding tensor in the multi-scale coding tensor based on the hybrid receptive field module. The hybrid receptive field module includes a cascaded adaptive large kernel convolutional unit and a multi-head attention mechanism unit. The adaptive large kernel convolutional unit is used to extract local adaptive neighborhood features of the intermediate coding tensor based on static convolution and dynamic deformable convolution to adapt to the missing regions indicated by the binary mask. The multi-head attention mechanism unit is used to perform long-distance dependency modeling on the local adaptive neighborhood features processed by the adaptive large kernel convolutional unit based on a self-attention mechanism to generate global context features. The multi-scale coded tensor is upsampled and reconstructed based on the decoder to obtain the reconstructed image of the image to be repaired.

[0100] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0101] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0102] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An image inpainting method integrating local adaptive and global dynamic receptive field, characterized in that, include: The image to be repaired is concatenated with the binary mask of the image to be repaired to obtain concatenated data; The multi-scale encoded tensor of the spliced data is obtained by downsampling the spliced data at multiple scales based on the encoder. During the downsampling process of the encoder, local feature extraction and global semantic modeling are performed on the intermediate coding tensor in the multi-scale coding tensor based on the hybrid receptive field module. The hybrid receptive field module includes a cascaded adaptive large kernel convolutional unit and a multi-head attention mechanism unit. The adaptive large kernel convolutional unit is used to extract local adaptive neighborhood features of the intermediate coding tensor based on static convolution and dynamic deformable convolution to adapt to the missing regions indicated by the binary mask. The multi-head attention mechanism unit is used to perform long-distance dependency modeling on the local adaptive neighborhood features processed by the adaptive large kernel convolutional unit based on a self-attention mechanism to generate global context features. The multi-scale coded tensor is upsampled and reconstructed based on the decoder to obtain the reconstructed image of the image to be repaired.

2. The image inpainting method integrating local adaptive and global dynamic receptive fields according to claim 1, characterized in that, The step of upsampling and reconstructing the multi-scale coded tensor based on the decoder to obtain the reconstructed image of the image to be repaired further includes: During the upsampling process of the decoder, the encoding tensor and decoding tensor of the same scale are fused based on the supervision and guidance module, and the image is reconstructed based on the fused tensor to obtain the reconstructed image of the image to be repaired. The supervision and guidance module is used to predict fusion weights based on the image to be repaired as a supervision signal, and to fuse the encoded tensor and the decoded tensor based on the fusion weights.

3. The image inpainting method integrating local adaptive and global dynamic receptive fields according to claim 1, characterized in that, The adaptive large-kernel convolutional unit includes a parallel large-kernel convolutional module and a dynamic convolutional module. The extraction of local adaptive neighborhood features of the intermediate encoded tensor based on static convolution and dynamic deformable convolution includes: The intermediate encoding tensor is convolved with static large kernels of different sizes in the large kernel convolution module to obtain static spatial features; The intermediate encoding tensor is convolved based on the deformable convolution in the dynamic convolution module to obtain dynamic shape adaptation features. The static spatial features and the dynamic shape adaptation features are spliced and fused to obtain the fused features; The fused features are weighted and combined with the intermediate encoding tensor through a gating mechanism to obtain the local adaptive neighborhood features.

4. The image inpainting method integrating local adaptive and global dynamic receptive fields according to claim 3, characterized in that, The large kernel convolution module is constructed based on spatial small convolution, spatial large convolution, and channel convolution.

5. The image inpainting method integrating local adaptive and global dynamic receptive fields according to claim 1, characterized in that, The method involves modeling long-distance dependencies of the local adaptive neighborhood features processed by the adaptive large-kernel convolutional unit based on a self-attention mechanism to generate global context features, including: Learnable position embedding codes are added to the local adaptive neighborhood features to obtain an input vector with positional information; The input vector is subjected to layer normalization to generate the query vector, key vector, and value vector of the input vector; The query vector, the key vector, and the value vector are split into multiple parallel attention heads; Calculate the scaled dot product attention in each attention head to obtain the attention output of each attention head; The outputs of all attention heads are concatenated, and the concatenated data is nonlinearly transformed through a feedforward neural network layer to generate the global context features.

6. The image inpainting method integrating local adaptive and global dynamic receptive fields according to claim 1, characterized in that, The encoder and the decoder are obtained through joint training based on the target loss function; The target loss function is constructed based on reconstruction loss, perception loss, style loss, and total variation loss; The reconstruction loss is used to characterize the difference between the reconstructed image and the original image at the pixel level; The perceptual loss is used to characterize the difference in deep semantic features between the reconstructed image and the original image; The style loss is used to characterize the consistency of texture distribution between the reconstructed image and the original image; The total variation loss is used to characterize the spatial smoothness of the reconstructed image.

7. An image inpainting device integrating local adaptive and global dynamic receptive field, characterized in that, include: The stitching module is used to stitch the image to be repaired with the binary mask of the image to be repaired to obtain stitched data; The downsampling module is used to perform multi-scale downsampling on the spliced data based on the encoder to obtain the multi-scale encoded tensor of the spliced data; During the downsampling process of the encoder, local feature extraction and global semantic modeling are performed on the intermediate coding tensor in the multi-scale coding tensor based on the hybrid receptive field module. The hybrid receptive field module includes a cascaded adaptive large kernel convolutional unit and a multi-head attention mechanism unit. The adaptive large kernel convolutional unit is used to extract local adaptive neighborhood features of the intermediate coding tensor based on static convolution and dynamic deformable convolution to adapt to the missing regions indicated by the binary mask. The multi-head attention mechanism unit is used to perform long-distance dependency modeling on the local adaptive neighborhood features processed by the adaptive large kernel convolutional unit based on a self-attention mechanism to generate global context features. The repair module is used to upsample and reconstruct the multi-scale coded tensor based on the decoder to obtain the reconstructed image of the image to be repaired.

8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, When the processor executes the computer program, it implements the image inpainting method that integrates local adaptive and global dynamic receptive fields as described in any one of claims 1 to 6.

9. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the image inpainting method that integrates local adaptation and global dynamic receptive field as described in any one of claims 1 to 6.

10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the image inpainting method that integrates local adaptation and global dynamic receptive field as described in any one of claims 1 to 6.