Image inpainting method and system based on context consistent multi-scale feature fusion

By adopting an image inpainting method based on context-consistent multi-scale feature fusion, the limitations of Transformer and Mamba architectures in terms of computational complexity and global modeling are solved, achieving efficient and stable image inpainting results and improving the quality and consistency of image inpainting.

CN122243825APending Publication Date: 2026-06-19YANTAI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
YANTAI UNIV
Filing Date
2026-05-20
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

The existing Transformer architecture has high computational complexity when processing long sequence data, resulting in high computational costs. Furthermore, the Mamba architecture has limitations in parallel processing and global context modeling, making it difficult to generate stable, high-quality image restoration results and failing to meet global consistency requirements.

Method used

An image inpainting method based on context-consistent multi-scale feature fusion is adopted. By automatically encoding through masking to generate intermediate feature maps and attention scores, and combining multi-scale feature decoders and an improved adaptive dual-feature fusion mechanism, hierarchical multi-scale feature enhancement is performed to achieve accurate fusion of local texture and global semantics.

Benefits of technology

It effectively reduces the computational cost of the model, improves the quality and stability of the restoration, and can generate high-quality image restoration results in different mask ratios and cross-dataset scenarios, while maintaining the semantic coherence and contextual consistency of the image.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243825A_ABST
    Figure CN122243825A_ABST
Patent Text Reader

Abstract

This invention belongs to the field of image processing technology, specifically relating to an image inpainting method and system based on context-consistent multi-scale feature fusion. The method includes constructing a mask image for visual representation learning, extracting contextual prior features, injecting positional information, and performing multi-scale upsampling to obtain a multi-scale prior representation; fusing the mask image with the multi-scale prior representation to obtain a first output feature, and performing hierarchical multi-scale feature enhancement with average attention weights, stacking the hierarchical multi-scale feature enhancement processing times according to a set number of levels to obtain a backbone feature; fusing the backbone feature with the multi-scale prior representation to obtain a new backbone feature; concatenating the new backbone feature, performing channel alignment and dimensionality reduction to obtain refined features, and then performing hierarchical multi-scale feature enhancement to obtain an output feature; and using the output feature as the final inpainted image. This invention improves image inpainting rendering performance through context-consistent multi-scale feature fusion.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of image processing technology, specifically relating to an image restoration method and system based on context-consistent multi-scale feature fusion. Background Technology

[0002] The core objective of image inpainting is to reconstruct missing parts of an image by generating semantically coherent and visually harmonious content, making it blend seamlessly with the surrounding area. This technology has become an important image editing tool, widely used for object removal and region restoration. Its technological development has evolved from traditional methods to advanced deep learning solutions. Among these, the Transformer architecture stands out due to its powerful attention mechanism and flexible model architecture, demonstrating significant advantages in complex scenarios such as multimodal learning, and has become one of the mainstream technological directions in the current field of image inpainting.

[0003] Despite the excellent performance of the Transformer architecture, its core self-attention mechanism has a fundamental flaw: its computational complexity is quadratic with the length of the input sequence. When processing long sequences of data, computational resource consumption increases dramatically, leading to high computational costs, which severely limits its application in real-world scenarios requiring large-scale or high-resolution data processing. To overcome this efficiency bottleneck, the emerging Mamba architecture introduces a novel State Space Model (SSM), achieving a linear relationship between computational complexity and sequence length, significantly improving efficiency. However, Mamba's autoregressive design inherently limits its parallel processing and global context modeling capabilities, lacking sufficient receptive field coverage and a coherent understanding of the overall image structure. Therefore, it struggles to generate stable, high-quality results in visual inpainting tasks, failing to meet the technical requirements for high global consistency.

[0004] Therefore, there is a need for image inpainting methods and systems based on context-consistent multi-scale feature fusion. Summary of the Invention

[0005] The purpose of this invention is to provide an image inpainting method and system based on context-consistent multi-scale feature fusion, so as to solve the shortcomings and other problems existing in the prior art.

[0006] To achieve the above objectives, the technical solution of the present invention is as follows: Image inpainting based on context-consistent multi-scale feature fusion includes the following steps: Step S1. Obtain a mask image based on the original image and the corresponding mask image, extract features from the mask image to obtain intermediate feature maps, and obtain the average attention weights at the same time; Step S2. After fusing the location information through intermediate feature mapping, perform stepwise upsampling to obtain the first multi-scale prior representation. Then, perform stepwise upsampling on the first multi-scale prior representation to obtain the second multi-scale prior representation, and so on, to obtain the Nth multi-scale prior representation. Step S3. Fuse the mask image with the Nth multi-scale prior representation to obtain the first output feature. Perform hierarchical multi-scale feature enhancement on the first output feature and the average attention weight to obtain the first backbone feature. Hierarchical multi-scale feature enhancement, with the number of processing iterations according to a set number of levels, is as follows: The first output feature and the average attention weight are used as input. The average attention weight is reshaped, and patches are extracted through unfolding. Then, grouped transpose convolution is performed and processed by a gated convolutional feedforward network to obtain the final local enhancement feature. The final local enhancement feature and the average attention weight are then processed by prior fusion state space and processed by a gated convolutional feedforward network. Step S4. Fuse the first backbone feature with the (N-1)th multi-scale prior representation to obtain the first intermediate feature. Update the first intermediate feature to the first output feature. Execute S3 to obtain the second backbone feature. And so on, to obtain the Nth backbone feature. Step S5. Concatenate the Nth backbone feature with the (N-1)th backbone feature and refine the feature to obtain the first result feature. Concatenate the first result feature with the (N-2)th backbone feature and refine the feature to obtain the second result feature. Continue in this manner until the (N-1)th result feature is obtained as the final repaired image.

[0007] Furthermore, in step S3, the first output feature and the average attention weight are used as input to reshape the average attention weight. At the same time, patches are extracted through an unfolding operation. Then, grouped transpose convolution is performed and processed by a gated convolutional feedforward network to obtain the final local enhanced features. The process is as follows: The first output feature and the average attention weight are used as input. Local features of each window are calculated within a non-overlapping rectangular window. The average attention weight is reshaped and patches are extracted from each spatial location using an unfolding operation. Grouped transpose convolution is performed between the reshaped average attention weight and the patch tensor. The final output is the result of connecting all attention enhancement features along the channel dimension and adding the transpose convolution, which yields the final local enhancement feature.

[0008] Furthermore, the process described in step S3, which involves prior fusion of the final local enhancement features and average attention weights in the state space and then processing them through a gated convolutional feedforward network, is as follows: A prior fusion state space model is established and a parallel branch consisting of convolution and activation functions is constructed. The final local enhancement features and average attention weights are used as inputs. The concatenation and fusion results are combined with the final local enhancement features through weighted residual connections. The outputs of the two branches are concatenated and passed through a linear projection layer and then processed by a gated convolutional feedforward network to finally obtain the first backbone features.

[0009] Furthermore, the process of fusing the mask image with the fourth-scale prior representation in step S3 is as follows: The mask image and the fourth multi-scale prior representation are connected along the channel dimension and passed through a lightweight convolutional network to generate a weight map. At the same time, the concatenated features are processed by convolution to generate a fused feature map. The fused feature map is combined with the previous features or backbone features through weighted residual connections to obtain the first output feature.

[0010] Furthermore, the feature refinement in step S5 includes alignment and dimensionality reduction processing and hierarchical multi-scale feature enhancement processing; the alignment and dimensionality reduction processing receives the concatenated features and performs convolution operations to complete channel alignment and dimensionality reduction, and after the feature dimensions are normalized, it is combined with the average attention weights for hierarchical multi-scale feature enhancement.

[0011] Furthermore, the gated convolutional feedforward network processing in step S3 is as follows: First, the two features are concatenated and transformed in terms of channel dimension. Then, the local spatial information is mixed and interacted through depthwise separable convolution. Subsequently, the fused features are adaptively filtered using a gating mechanism, and then restored to the original channel dimension through dimension reduction convolution to complete feature processing.

[0012] Furthermore, in step S2, the location information is fused, as shown below: The intermediate feature map is embedded at a location, and the output after location embedding is concatenated with the intermediate feature map.

[0013] Image inpainting systems based on context-consistent multi-scale feature fusion include: The mask autoencoder module obtains a mask image based on the original image and the corresponding mask image, extracts features from the mask image to obtain an intermediate feature map, and obtains the average attention weight. The multi-scale feature decoder module performs upsampling processing step by step after fusing the intermediate feature mapping with the location information to obtain the first multi-scale prior representation. The first multi-scale prior representation is then upsampled step by step to obtain the second multi-scale prior representation, and so on to obtain the Nth multi-scale prior representation. An improved fusion and hierarchical multi-scale enhancement module is used to fuse the mask image with the Nth multi-scale prior representation to obtain the first output feature. The first output feature is then enhanced with the average attention weight in a hierarchical multi-scale feature manner to obtain the first backbone feature. Hierarchical multi-scale feature enhancement, with the number of processing iterations according to a set number of levels, is as follows: The first output feature and the average attention weight are used as input. The average attention weight is reshaped, and patches are extracted through unfolding. Then, grouped transpose convolution is performed and processed by a gated convolutional feedforward network to obtain the final local enhancement feature. The final local enhancement feature and the average attention weight are then processed by prior fusion state space and processed by a gated convolutional feedforward network. The stacked feature enhancement module fuses the first backbone feature with the (N-1)th multi-scale prior representation to obtain the first intermediate feature. The first intermediate feature is then updated to the first output feature. The improved fusion and hierarchical multi-scale enhancement module is then executed to obtain the second backbone feature, and so on, to obtain the Nth backbone feature. The image restoration and reconstruction module splices the Nth backbone feature with the (N-1)th backbone feature and refines the feature to obtain the first result feature. The first result feature is then spliced ​​with the (N-2)th backbone feature and refined to obtain the second result feature. This process is repeated until the (N-1)th result feature is obtained, which is then used as the final restored image.

[0014] Furthermore, the final local enhancement feature module is constructed in the improved fusion and hierarchical multi-scale enhancement module. The first output feature and the average attention weight are used as input. The local features of each window are calculated within a non-overlapping rectangular window. The average attention weight is reshaped, and a patch is extracted from each spatial location using an unfolding operation. A grouped transposed convolution is performed between the reshaped average attention weight and the patch tensor. The final output is the result of connecting all attention enhancement features along the channel dimension and adding the transposed convolution.

[0015] Furthermore, the first backbone feature module is constructed in the improved fusion and hierarchical multi-scale enhancement module. A prior fusion state space model and a parallel branch composed of convolution and activation functions are established. The final local enhancement features and average attention weights are used as inputs. The splicing fusion result is combined with the final local enhancement features through weighted residual connections. The outputs of the two branches are connected and passed through a linear projection layer, and then processed by a gated convolutional feedforward network.

[0016] Compared with the prior art, the technical solution provided by this invention has the following advantages: 1. This invention introduces masking auto-encoding to generate intermediate feature maps and attention scores as context guidance, which not only effectively avoids the randomness of the diffusion model, but also ensures the semantic coherence and contextual consistency between the repaired region and the original region.

[0017] This invention injects and fuses spatial location information into intermediate feature maps. This process not only preserves the core information of prior features but also provides rich multi-scale spatial priors for subsequent modules. The multi-scale feature decoder provides multi-scale spatial priors and improves scale adaptability by progressively upsampling through transposed convolutional layers.

[0018] 2. This invention transforms single-scale prior features into multi-scale features, and then combines this with improved adaptive dual-feature fusion dynamic gating processing to adaptively balance the contributions of multi-source features, thereby achieving accurate fusion of local texture and global semantics and improving the repair performance in complex occlusion scenarios. Moreover, after the intermediate feature mapping is transformed into multi-scale prior features, it provides dimensional adaptation contextual guidance for hierarchical multi-scale feature enhancement processing. At the same time, by guiding the fusion of local contextual attention and the output features of Mamba prior fusion, the model's ability to perceive occluded regions of images is further enhanced.

[0019] By improving the adaptive gating fusion mechanism of adaptive dual-feature fusion and the heterogeneous feature capture and fine-tuning of hierarchical multi-scale feature enhancement, combined with the cyclic injection and stacking enhancement of multi-level prior features, efficient integration of local details and global priors is achieved, ultimately generating fusion features with strong representational capabilities to improve the repair effect.

[0020] 3. This invention enhances the expressive power of local structures by using a grouping and transposition mechanism guided by attention weights to achieve dynamic weighted aggregation of spatial context information.

[0021] This invention enhances the linear complexity modeling of Mamba and the context-aware capabilities of Transformer by hierarchically multi-scale feature enhancement. While maintaining restoration quality, it significantly reduces model computational costs, thus achieving efficient inference. It demonstrates excellent stability and generalization ability across different mask ratios and cross-dataset scenarios, and accurately preserves key identity and structural information of the image, further improving the restoration quality. Attached Figure Description

[0022] Figure 1 is a comparison diagram before and after repair obtained by the present invention, wherein (a) is a schematic diagram before repair of the first group of embodiments of the present invention; (b) is a schematic diagram after repair of the first group of embodiments of the present invention; (c) is a schematic diagram before repair of the second group of embodiments of the present invention; and (d) is a schematic diagram after repair of the second group of embodiments of the present invention. Detailed Implementation

[0023] To further understand the content of this invention, the invention will be described in detail with reference to the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

[0024] refer to Figure 1 This embodiment provides an image inpainting method based on context-consistent multi-scale feature fusion, including the following steps: S1. Obtain the mask image based on the original image and the corresponding mask image, extract features from the mask image to obtain the intermediate feature map, and obtain the average attention weight at the same time.

[0025] Specifically, given a set of original images The mask image is obtained by merging the mask image with the corresponding mask map. The mask image is mainly composed of pixels with values ​​of 1 and 0. 0 represents the masked area (usually the part that needs to be processed or ignored), and 1 represents the non-masked area (the effective area or the original image). It should be noted that the original image... The size is strictly consistent with the mask image to ensure precise pixel alignment during the merging process.

[0026] Based on this, the original image is transformed into the corresponding mask image through Mask Autoencoding (MAE). Visual representation learning is then carried out using the mask image, as detailed below: The core idea of ​​mask autoencoding is to randomly mask most of the input image, which is then constructed based on the merged mask image. The model is then trained to reconstruct the missing pixels, thereby learning an efficient visual representation.

[0027] Based on this, and leveraging MAE's powerful contextual reasoning capabilities, its intermediate features are mapped. (H represents the image height, i.e., the number of feature units in the vertical direction; W represents the image width, i.e., the number of feature units in the horizontal direction; C represents the number of feature channels) and average attention weight. (n is the number of tokens in the MAE input sequence) is used as the context prior. The intermediate feature mapping and average attention weight calculation process is as follows: in, The input to MAE is represented by a mask image or its derived features; L represents the forward mapping of the intermediate modules of the MAE encoder; H represents the total number of layers in the MAE encoder; H represents the number of multi-heads in a single attention layer. Let represent the attention weight matrix of the j-th head in the i-th layer.

[0028] This invention introduces mask auto-encoding to generate intermediate feature maps and attention scores as context guidance, which not only effectively avoids the randomness of the diffusion model, but also ensures the semantic coherence and contextual consistency between the repaired region and the original region.

[0029] S2. After fusing location information with intermediate feature mapping, perform stepwise upsampling to obtain the first multi-scale prior representation. Then, perform stepwise upsampling on the first multi-scale prior representation to obtain the second multi-scale prior representation, and so on, to obtain the Nth multi-scale prior representation.

[0030] Specifically, mask image The input is fed into the mask autoencoder, using its intermediate feature map (PF) and average attention weight (S) as contextual priors. To enhance the spatial awareness of the features, the specific process of fusing location information into the intermediate feature map is as follows: The position embedding process is performed on the PF to inject spatial location information into the features. Then, the output of the position embedding process is concatenated with the original PF. While retaining the core information of the original features, the position information is supplemented, thereby enhancing the spatial perception ability of the features.

[0031] After location information enhancement, the stitched features are input into a multi-scale feature decoder for step-by-step upsampling processing to generate features at multiple spatial scales. , This represents the multi-scale prior features output in the k-th upsampling stage. The multi-scale feature decoder consists of transposed convolutional layers. Each transposed convolutional layer reduces the number of channels by half, thereby progressively obtaining multi-scale prior representations from low-resolution high-semantic features to high-resolution detail-enhancing features, effectively improving multi-scale detection capabilities. The specific sampling process described above is shown below: in, Indicates the activation function; This indicates a normalization operation; This represents the i-th transposed convolutional layer.

[0032] In this embodiment of the invention, after the intermediate feature mapping fuses the location information, it is subjected to step-by-step upsampling processing to obtain the first multi-scale prior representation. The first multi-scale prior representation is subjected to step-by-step upsampling processing to obtain the second multi-scale prior representation, and so on, to obtain the third multi-scale prior representation and the fourth scale prior representation.

[0033] This invention injects and fuses spatial location information into intermediate feature maps. This process not only preserves the core information of prior features but also provides rich multi-scale spatial priors for subsequent modules. The multi-scale feature decoder provides multi-scale spatial priors and improves scale adaptability by progressively upsampling through transposed convolutional layers.

[0034] S3. The mask image is fused with the Nth multi-scale prior representation to obtain the first output feature. The first output feature is then enhanced with the average attention weight in a hierarchical multi-scale feature manner to obtain the first backbone feature.

[0035] Specifically, the mask image The Nth multi-scale prior representation is fed into an improved adaptive dual-feature fusion process specifically designed for feature fusion. Specifically, the improved adaptive dual-feature fusion process is a process of fusing multi-scale prior features with input features, local contextual attention with Mamba prior fusion features.

[0036] Furthermore, the improved adaptive dual-feature fusion process integrates features from two different sources through an adaptive gating fusion mechanism. The core fusion logic is as follows: Set the current input features as The previous characteristics were ,For example: Or the output of a previous ADFusion; Will and Connect along the channel dimension (denoted as) The weights are passed through a lightweight convolutional network to generate a weight map. This graph controls the fusion strength, thus acting as a gate; the activation function of the weight graph is... ; Meanwhile, the spliced ​​features The fused feature map is generated by processing through 1×1 convolution. Then, it is combined with one of the original inputs through a weighted residual connection, that is: and Weighted fusion; while introducing controlled fusion information, it preserves the continuity of backbone features, adaptively balances contributions from different sources, and enhances overall representational ability. The specific process described above is as follows: in, Indicates the current input features; Indicates previous characteristics Or the output of a previous ADFusion; Represents the Sigmoid function; [*,*] denotes element-wise multiplication; [*,*] denotes the concatenation of feature maps; F denotes the intermediate process enhanced feature map; This indicates the output residual fusion result (i.e., ADFusion fusion feature).

[0037] The first output feature and the average attention weight are used as input. The average attention weight is reshaped, and patches are extracted by unfolding. Then, grouped transpose convolution is performed and processed by a gated convolutional feedforward network to obtain the final local enhancement feature. The final local enhancement feature and the average attention weight are processed by prior fusion state space and then processed by a gated convolutional feedforward network to obtain the first backbone feature.

[0038] Specifically, the first output feature is combined with the average attention weight S for hierarchical multi-scale feature enhancement, completing global feature modeling and enhancement. The hierarchical multi-scale feature enhancement process utilizes a dynamic weight learning mechanism, which consists of two parts: local contextual attention and Mamba prior fusion. The hierarchical multi-scale feature enhancement process iterates according to a set number of levels, as shown below: With local contextual attention and Mamba prior fusion as the core, local image details and multi-scale global prior information are captured respectively to form two heterogeneous features. The two features are refined and gated through a gated convolutional feedforward network. Then, the two features processed by the gated convolutional feedforward network are further processed by an improved adaptive dual feature fusion to achieve adaptive weighted fusion, which efficiently integrates local and global information and achieves the optimal integration of the two heterogeneous feature representations.

[0039] The features output from the improved adaptive dual-feature fusion process are then combined with the average attention weight S for local contextual attention processing. A grouping and transposition mechanism guided by the average attention weight S is used to dynamically weight and aggregate spatial contextual information, enhancing the expressive power of local structures. The specific process is as follows: First, spatial attention weights are computed within non-overlapping rectangular windows to enhance local features, highlighting important information while suppressing irrelevant details, thus obtaining the local features of each window. Horizontal and vertical windows are processed in parallel within the attention head, with the input... Classified as Non-overlapping windows Where sh≠sw (rectangular window) This indicates the local window number. The process is represented as: in, , and It is a learnable linear projection matrix; , , These are the query, key, and value characteristics of the p-th window; This represents the output feature of the p-th window after the attention mechanism; B is the dynamic relative position encoding. Generate the attention weight matrix within the local window; It is a scaling factor; This represents the local features calculated separately for each window after the local context attention process is split into multiple rectangular windows.

[0040] Secondly, in order to incorporate spatial attention into feature aggregation, a new mechanism is defined that utilizes average attention weights. (b is the number of groups, used as a dimension parameter; S is reshaped into a grouped transposed convolution kernel.) Grouped transposed convolution is used to guide the extraction and fusion of local context. First, the average attention weights are reshaped, as follows: in, L represents the sequence length. It should be noted that the average attention weights S are rearranged in dimensions to form a dynamic convolution kernel.

[0041] Then, an unfolding operation is used to extract a k×k patch from each spatial location in V, where k represents the patch spatial size: in, From The value obtained by direct projection does not require window partitioning; Next, in the reshaped attention score and patch tensor Grouped transposed convolutions are performed between the data points. This operation serves as a weighted aggregation mechanism, where attention weights adjust the contribution of each local region. Specifically, attention weights are used as convolution kernels to perform transposed convolutions on the feature maps, with each batch processed independently through grouping. Simultaneously, intermediate outputs are obtained from auxiliary branches. ; The final output consists of all connections along the channel dimension. This is combined with the result of the transposed convolution to form the final local enhanced feature Y: Where GroupedConvT represents the grouped transpose convolution operation. It is all attention-enhancing features The output of .

[0042] This invention enhances the expressive power of local structures by using a grouping and transposition mechanism guided by attention weight S to achieve dynamic weighted aggregation of spatial context information.

[0043] In Mamba prior fusion, a prior fusion state space model is introduced, integrating a spatial attention-aware feature refinement mechanism to enhance Mamba's sequence modeling capabilities and improve global structure modeling performance. The prior fusion state space model uses average attention weights. As input, and execute the formula The operation is defined in [the code snippet]. Then, it performs weighted aggregation on the block-level features, ensuring that each position in feature X receives context-dependent information. Block-level features are small local windows of features cut from the feature map by the Unfold operation, used for local context aggregation. The subsequent S6 module is the core of Mamba, implementing global receptive field, dynamic weights, and linear complexity. Simultaneously, a parallel branch consisting of convolutions and SiLU activation functions is introduced to mitigate the potential loss of local information inherent in sequence modeling. Finally, the outputs of the two branches are concatenated and passed through a linear projection layer. The above process can be represented as: in, To represent a linear layer, respectively using and As the input and output dimensions; Linear(C,C / 2) is the inlet dimensionality reducer, responsible for splitting the input features into two low-dimensional branches to reduce the burden on subsequent parallel processing; Linear(C / 2,C) is the outlet dimensionality increaser, responsible for fusing the features of the two branches to restore them to the original dimension, completing the closed loop of the entire module; This represents two parallel branches split from the same feature in the Mamba parallel branch fusion module. , The splicing and fusion result of ) This is the SiLU activation function; Conv(*) and Concat(*,*) represent one-dimensional convolution and concatenation operations, respectively; S6 is the core of Mamba, used to achieve efficient long-range spatial modeling, dynamic weight selection and global receptive field capture with linear complexity.

[0044] This invention transforms single-scale prior features into multi-scale features using a designed multi-scale feature decoder. Combined with improved adaptive dual-feature fusion dynamic gating, it adaptively balances the contributions of multiple feature sources, achieving accurate fusion of local texture and global semantics, and improving restoration performance in complex occlusion scenarios. Furthermore, the intermediate feature maps, after being converted into multi-scale prior features by the multi-scale feature decoder, provide dimensional adaptation contextual guidance for hierarchical multi-scale feature enhancement processing. Simultaneously, by guiding the fusion of local contextual attention and the output features of Mamba prior fusion, the model's ability to perceive occluded regions in images is further enhanced.

[0045] By improving the adaptive gating fusion mechanism of adaptive dual-feature fusion and the heterogeneous feature capture and fine-tuning of hierarchical multi-scale feature enhancement, combined with the cyclic injection and stacking enhancement of multi-level prior features, efficient integration of local details and global priors is achieved, ultimately generating fusion features with strong representational capabilities to improve the repair effect.

[0046] The output features, after being fused with local contextual attention and Mamba priors, will be further processed through a gated convolutional feedforward network and subjected to dynamic feature fusion to gradually enhance multi-scale representation capabilities.

[0047] Specifically, the gated convolutional feedforward network first performs channel-dimensional concatenation and dimensionality upscaling on the two feature streams. Then, depthwise separable convolution is used to mix and interact local spatial information. Subsequently, a gating mechanism is used to adaptively filter the fused features, suppressing redundant information and enhancing effective feature representation, followed by dimensionality reduction convolution. The original channel dimension is restored, and the features are refined. Building upon the gated convolutional feedforward network processing, dynamic feature fusion is further implemented, allowing effective features of different scales and dimensions to be adaptively weighted and integrated, achieving a gradual enhancement of multi-scale representation capabilities and in-depth feature mining. The gated adaptive filtering process is as follows: in, Hadamard product (element-by-element multiplication); Represented as a characteristic gating function; This is represented as the output feature of a depthwise separable convolution.

[0048] S4. Fuse the first backbone feature with the (N-1)th multi-scale prior representation to obtain the first intermediate feature. Update the first intermediate feature to the first output feature. Execute S3 to obtain the second backbone feature, and so on to obtain the Nth backbone feature.

[0049] Specifically, the hierarchical multi-scale feature enhancement process can be repeated N times to gradually strengthen the multi-scale feature representation capability, providing high-quality fusion features for subsequent image restoration.

[0050] In addition, other levels It will also be combined with the output features of the corresponding hierarchical multi-scale feature enhancement processing to improve adaptive dual feature fusion, participate in the feature fusion of subsequent multi-level improved adaptive dual feature fusion and hierarchical multi-scale feature enhancement, and continuously inject multi-scale prior information.

[0051] In this embodiment of the invention, the hierarchical multi-scale feature enhancement process adopts a layered stacking structure. During the encoding stage, the hierarchical stacking quantities are sequentially set to 2, 4, 4, and 6 layers, respectively. During the decoding stage, the corresponding hierarchical stacking quantities are set to 4, 4, and 2 layers. That is, each layer will perform a corresponding number of stacking operations. The hierarchical multi-scale feature enhancement process can repeat the stacking N times in the current layer, where N is determined by the network's layer configuration. The configuration used in this invention is [2, 4, 4, 6], corresponding to the stacking quantities of the four encoder layers. The decoding stage uses [4, 4, 2].

[0052] In this embodiment of the invention, when N=4, the first backbone feature is fused with the third multi-scale prior representation to obtain the first intermediate feature. The first intermediate feature is updated to the first output feature, and S3 is executed to obtain the second backbone feature. The second backbone feature is fused with the second multi-scale prior representation to obtain the second intermediate feature. The second intermediate feature is updated to the first output feature and subjected to hierarchical multi-scale feature enhancement with the average attention weight to obtain the third backbone feature. The third backbone feature is fused with the first multi-scale prior representation to obtain the third intermediate feature. The third intermediate feature is updated to the first output feature and subjected to hierarchical multi-scale feature enhancement with the average attention weight to obtain the fourth backbone feature.

[0053] S5. Concatenate the Nth main feature with the (N-1)th main feature and refine the feature to obtain the first result feature. Concatenate the first result feature with the (N-2)th main feature and refine the feature to obtain the second result feature. Continue in this manner until the (N-1)th result feature is obtained as the final repaired image.

[0054] Feature refinement includes alignment and dimensionality reduction, and hierarchical multi-scale feature enhancement. The hierarchical multi-scale feature enhancement process is shown in S3, with the number of stacked levels corresponding to that in S4. The specific feature refinement process is as follows: The system receives the features output from the previous level of hierarchical multi-scale feature enhancement processing and the features output from the level before that, and concatenates them along the channels to achieve initial fusion. The concatenated feature input Conv1×1 completes channel alignment and dimensionality reduction. After the feature is normalized in dimension, it is subjected to hierarchical multi-scale feature enhancement with the average attention weight S to achieve global modeling and feature refinement enhancement, resulting in the first result feature. The first result feature is then used as part of the input for the next concatenation.

[0055] In this embodiment of the invention, when N=4, the fourth backbone feature and the third backbone feature are spliced ​​together and the feature is refined to obtain the first result feature. The first result feature is spliced ​​together and the second backbone feature is refined to obtain the second result feature. The second result feature is spliced ​​together and the first backbone feature is refined to obtain the third result feature, which is the final repaired image.

[0056] In this embodiment of the invention, four loss functions are used for joint training, and an overall loss function is defined. This improves the quality of image restoration.

[0057] To obtain high-quality images and semantic consistency, we adopted the following four loss functions: Loss, Countering Loss Style loss and perceived loss .

[0058] We will use these four loss functions for joint training and define an overall loss function. To achieve better visual quality, the final loss function is expressed as: in, , , and These are the corresponding loss weights, set =1, =0.1, =250, =0.1.

[0059] This invention demonstrates excellent stability and generalization ability across different mask ratios and cross-dataset scenarios, and can accurately preserve key identity and structural information of images, further improving the quality of image restoration.

[0060] Image inpainting systems based on context-consistent multi-scale feature fusion include: The mask autoencoder module obtains a mask image based on the original image and the corresponding mask image, extracts features from the mask image to obtain an intermediate feature map, and obtains the average attention weight. The multi-scale feature decoder module performs upsampling processing step by step after fusing the intermediate feature mapping with the location information to obtain the first multi-scale prior representation. The first multi-scale prior representation is then upsampled step by step to obtain the second multi-scale prior representation, and so on to obtain the Nth multi-scale prior representation. An improved fusion and hierarchical multi-scale enhancement module is used to fuse the mask image with the Nth multi-scale prior representation to obtain the first output feature. The first output feature is then enhanced with the average attention weight in a hierarchical multi-scale feature manner to obtain the first backbone feature. The hierarchical multi-scale feature enhancement process iterates according to a set number of levels, as shown below: The first output feature and the average attention weight are used as input. The average attention weight is reshaped, and patches are extracted through unfolding. Then, grouped transpose convolution is performed and processed by a gated convolutional feedforward network to obtain the final local enhancement feature. The final local enhancement feature and the average attention weight are then processed by prior fusion state space and processed by a gated convolutional feedforward network. The stacked feature enhancement module fuses the first backbone feature with the (N-1)th multi-scale prior representation to obtain the first intermediate feature. The first intermediate feature is then updated to the first output feature. The improved fusion and hierarchical multi-scale enhancement module is then executed to obtain the second backbone feature, and so on, to obtain the Nth backbone feature. The image restoration and reconstruction module splices the Nth backbone feature with the (N-1)th backbone feature and refines the feature to obtain the first result feature. The first result feature is then spliced ​​with the (N-2)th backbone feature and refined to obtain the second result feature. This process is repeated until the (N-1)th result feature is obtained, which is then used as the final restored image.

[0061] The improved fusion and hierarchical multi-scale enhancement module constructs a final local enhancement feature module. It takes the first output feature and the average attention weight as input, calculates the local features of each window within a non-overlapping rectangular window, reshapes the average attention weight, and extracts patches from each spatial location using an unfolding operation. It performs grouped transpose convolution between the reshaped average attention weight and the patch tensor, and finally outputs the result of connecting all attention enhancement features along the channel dimension and adding the transpose convolution.

[0062] The improved fusion and hierarchical multi-scale enhancement module constructs a first backbone feature module, establishes a prior fusion state space model and a parallel branch composed of convolution and activation functions, takes the final local enhancement features and average attention weights as input, and combines the splicing fusion result with the final local enhancement features through weighted residual connections. The outputs of the two branches are connected and passed through a linear projection layer, and then processed by a gated convolutional feedforward network.

[0063] The above are merely preferred embodiments of the present invention and are not intended to limit the present invention in any other way. Any person skilled in the art may make changes or modifications to the above-disclosed technical content to create equivalent embodiments that can be applied to other fields. However, any simple modifications, equivalent changes, and modifications made to the above embodiments based on the technical essence of the present invention without departing from the scope of the present invention shall still fall within the protection scope of the present invention.

Claims

1. An image inpainting method based on context-consistent multi-scale feature fusion, characterized in that, Includes the following steps: Step S1. Obtain a mask image based on the original image and the corresponding mask image, extract features from the mask image to obtain intermediate feature maps, and obtain the average attention weights at the same time; Step S2. After fusing the location information through intermediate feature mapping, perform stepwise upsampling to obtain the first multi-scale prior representation. Then, perform stepwise upsampling on the first multi-scale prior representation to obtain the second multi-scale prior representation, and so on, to obtain the Nth multi-scale prior representation. Step S3. Fuse the mask image with the Nth multi-scale prior representation to obtain the first output feature. Perform hierarchical multi-scale feature enhancement on the first output feature and the average attention weight to obtain the first backbone feature. The hierarchical multi-scale feature enhancement process iterates according to a set number of levels, as shown below: The first output feature and the average attention weight are used as inputs to reshape the average attention weight. At the same time, patches are extracted through unfolding operations. Then, grouped transpose convolution is performed and processed by a gated convolutional feedforward network to obtain the final local enhancement features. The final local enhancement features and average attention weights are processed in a priori fusion state space and then processed through a gated convolutional feedforward network. Step S4. Fuse the first backbone feature with the (N-1)th multi-scale prior representation to obtain the first intermediate feature. Use the first intermediate feature as the first output feature and execute S3 to obtain the second backbone feature. Repeat this process to obtain the Nth backbone feature. Step S5. Concatenate the Nth backbone feature with the (N-1)th backbone feature and refine the feature to obtain the first result feature. Concatenate the first result feature with the (N-2)th backbone feature and refine the feature to obtain the second result feature. Continue in this manner until the (N-1)th result feature is obtained as the final repaired image.

2. The image inpainting method based on context-consistent multi-scale feature fusion according to claim 1, characterized in that, The process described in step S3, which involves taking the first output feature and the average attention weight as input, reshaping the average attention weight, extracting patches through an unfolding operation, performing grouped transposed convolutions, and then processing the data through a gated convolutional feedforward network to obtain the final local enhanced features, is as follows: The first output feature and the average attention weight are used as input. Local features of each window are calculated within a non-overlapping rectangular window. The average attention weight is reshaped and patches are extracted from each spatial location using an unfolding operation. Grouped transpose convolution is performed between the reshaped average attention weight and the patch tensor. The final output is the result of connecting all attention enhancement features along the channel dimension and adding the transpose convolution, which yields the final local enhancement feature.

3. The image inpainting method based on context-consistent multi-scale feature fusion according to claim 1, characterized in that, The process described in step S3, which involves prior fusion of the final local enhancement features and average attention weights in the state space and then processing them through a gated convolutional feedforward network, is as follows: A prior fusion state space model is established and a parallel branch consisting of convolution and activation functions is constructed. The final local enhancement features and average attention weights are used as inputs. The concatenation and fusion results are combined with the final local enhancement features through weighted residual connections. The outputs of the two branches are concatenated and passed through a linear projection layer and then processed by a gated convolutional feedforward network to finally obtain the first backbone features.

4. The image inpainting method based on context-consistent multi-scale feature fusion according to claim 1, characterized in that, The process of fusing the masked image with the fourth-scale prior representation in step S3 is as follows: The mask image and the fourth multi-scale prior representation are connected along the channel dimension and passed through a lightweight convolutional network to generate a weight map. At the same time, the concatenated features are processed by convolution to generate a fused feature map. The fused feature map is combined with the previous features or backbone features through weighted residual connections to obtain the first output feature.

5. The image inpainting method based on context-consistent multi-scale feature fusion according to claim 1, characterized in that, The feature refinement described in step S5 includes alignment dimensionality reduction processing and hierarchical multi-scale feature enhancement processing; The alignment and dimensionality reduction process receives the spliced ​​features and performs convolution operations to complete channel alignment and dimensionality reduction. After the features are normalized in dimension, they are combined with the average attention weights for hierarchical multi-scale feature enhancement.

6. The image inpainting method based on context-consistent multi-scale feature fusion according to claim 1, characterized in that, The gated convolutional feedforward network processing in step S3 is as follows: First, the two features are concatenated and transformed in terms of channel dimension. Then, the local spatial information is mixed and interacted through depthwise separable convolution. Subsequently, the fused features are adaptively filtered using a gating mechanism, and then restored to the original channel dimension through dimension reduction convolution to complete feature processing.

7. The image inpainting method based on context-consistent multi-scale feature fusion according to claim 1, characterized in that, The fusion of location information in step S2 is as follows: The intermediate feature map is embedded at a location, and the output after location embedding is concatenated with the intermediate feature map.

8. An image inpainting system based on context-consistent multi-scale feature fusion, applied to the image inpainting method based on context-consistent multi-scale feature fusion as described in claim 1, characterized in that, include: The mask autoencoder module obtains a mask image based on the original image and the corresponding mask image, extracts features from the mask image to obtain an intermediate feature map, and obtains the average attention weight. The multi-scale feature decoder module performs upsampling processing step by step after fusing the intermediate feature mapping with the location information to obtain the first multi-scale prior representation. The first multi-scale prior representation is then upsampled step by step to obtain the second multi-scale prior representation, and so on to obtain the Nth multi-scale prior representation. An improved fusion and hierarchical multi-scale enhancement module is used to fuse the mask image with the Nth multi-scale prior representation to obtain the first output feature. The first output feature is then enhanced with the average attention weight in a hierarchical multi-scale feature manner to obtain the first backbone feature. The hierarchical multi-scale feature enhancement process iterates according to a set number of levels, as shown below: The first output feature and the average attention weight are used as input. The average attention weight is reshaped, and patches are extracted through unfolding. Then, grouped transpose convolution is performed and processed by a gated convolutional feedforward network to obtain the final local enhancement feature. The final local enhancement feature and the average attention weight are then processed by prior fusion state space and processed by a gated convolutional feedforward network. The stacked feature enhancement module fuses the first backbone feature with the (N-1)th multi-scale prior representation to obtain the first intermediate feature. The first intermediate feature is then updated to the first output feature. The improved fusion and hierarchical multi-scale enhancement module is then executed to obtain the second backbone feature, and so on, to obtain the Nth backbone feature. The image restoration and reconstruction module splices the Nth backbone feature with the (N-1)th backbone feature and refines the feature to obtain the first result feature. The first result feature is then spliced ​​with the (N-2)th backbone feature and refined to obtain the second result feature. This process is repeated until the (N-1)th result feature is obtained, which is then used as the final restored image.

9. The image inpainting system based on context-consistent multi-scale feature fusion according to claim 8, characterized in that, The improved fusion and hierarchical multi-scale enhancement module includes a final local enhancement feature module. It takes the first output feature and the average attention weight as input, calculates the local features of each window within a non-overlapping rectangular window, reshapes the average attention weight, and extracts patches from each spatial location using an unfolding operation. It performs grouped transpose convolution between the reshaped average attention weight and the patch tensor, and finally outputs the result of connecting all attention enhancement features along the channel dimension and adding the transpose convolution.

10. The image inpainting system based on context-consistent multi-scale feature fusion according to claim 8, characterized in that, The improved fusion and hierarchical multi-scale enhancement module includes a first backbone feature module, which establishes a prior fusion state space model and a parallel branch composed of convolution and activation functions. The final local enhancement features and average attention weights are used as inputs. The splicing fusion result is combined with the final local enhancement features through weighted residual connections. The outputs of the two branches are connected and passed through a linear projection layer, and then processed by a gated convolutional feedforward network.