A text image shadow removing method based on background guidance

By using a background-guided two-stage network model, the problem of poor generalization ability of text image shadow removal methods is solved, generating shadow-free images with good visual effects and clear textures, thereby improving image readability and the accuracy of automated processing.

CN116703757BActive Publication Date: 2026-06-26WUHAN UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
WUHAN UNIV OF SCI & TECH
Filing Date
2023-05-16
Publication Date
2026-06-26

Smart Images

  • Figure CN116703757B_ABST
    Figure CN116703757B_ABST
Patent Text Reader

Abstract

The application provides a text image shadow elimination method based on background guidance, comprising the following steps: establishing a text image shadow data set of a real scene and a corresponding shadow-free image data set; constructing a background extraction network for obtaining a shadow-free background image corresponding to an input shadow image and a background encoder feature; constructing a background-guided generator network model, which comprises two stages, and using a background-based attention module and a texture enhancement module in the second stage to refine the result of the first stage and obtain a final shadow elimination result image; constructing a discriminator network model; constructing a loss function to train and optimize the generator network model and the discriminator network model, and obtaining the optimized generator network model and the discriminator network model to eliminate the shadow of a to-be-processed shadow image, so as to obtain a shadow-free image with good visual effect, strong contrast and naturalness.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image processing technology, and in particular to a method for removing shadows from text images based on background guidance. Background Technology

[0002] Documents are ubiquitous in people's daily lives, such as textbooks, newspapers, flyers, and receipts. These documents typically need to be saved as electronic files for digital archiving or online messaging. With the widespread use of mobile phones, people prefer to use them to digitize document copies. However, when light is obstructed, captured text images are easily affected by shadows. Low brightness in shadowed areas reduces the quality and readability of text images, making content difficult to read and resulting in a poor user experience. Simultaneously, shadows can also cause a loss of image sharpness, contrast, and detail, which can affect subsequent image processing applications, such as optical character recognition. Therefore, text image shadow removal is an essential image processing task in computer vision applications.

[0003] Document image shadow removal refers to the removal of shadows from document images using image processing techniques, making the document images clearer and easier to read. In document digitization and image processing applications, document image shadow removal is an important preprocessing step that can improve the readability of document images and the accuracy of automated processing. By removing shadows, the contrast between text and other image elements is enhanced, making the text clearer and reducing the error rate in recognition. Document image shadow removal can be achieved through various algorithms and techniques, such as traditional methods based on color space and texture, which utilize prior knowledge to model and complete shadow removal, and deep learning-based methods, which learn the mapping relationship between shadowed and shadowless images in the training dataset to restore the texture and color of shadowed areas under normal lighting. However, existing traditional methods often only work well for a certain type of shadow and have poor generalization ability. Most existing deep learning-based shadow removal methods are designed for natural images. Due to the different image attributes between natural images and text images, natural image shadow removal methods cannot effectively remove shadows from text images. Summary of the Invention

[0004] This invention is made to solve the above-mentioned problems and aims to provide a background-guided text image shadow removal method. It proposes a two-stage network model to complete the shadow removal task of the image and obtain a more natural and realistic text image after shadow removal.

[0005] To achieve the above objectives, the present invention adopts the following technical solution:

[0006] A background-guided text image shadow removal method includes:

[0007] S1: Construct a text image shadow dataset X consisting of shadow images I of real scenes, and the corresponding real shadowless images I. gt The dataset Y, consisting of shadowless images;

[0008] S2: Construct a background extraction network to obtain the shadowless background image corresponding to the input shadow image I. and background encoder features x f ;

[0009] S3: Construct a background-guided generator network model, which consists of two stages; the first stage is based on the shadow image I and the background encoder features x. f Obtain initial shadow removal result I coarse The second stage uses a background-based attention module and a texture enhancement module to refine the initial shadow removal results from the first stage. coarse Thus, the final shadow removal result image I is obtained. final ;

[0010] S4: Construct a discriminator network model to distinguish the generated shadow-removed result image I. final Compared to a true shadowless image I gt ;

[0011] S5: Construct a loss function to train and optimize the generator network model and the discriminator network model, and obtain the optimized generator network model and discriminator network model to perform shadow removal on the shadow image to be processed.

[0012] Furthermore, in step S2, the background extraction network includes an encoder and a decoder. The encoder applies several convolutional layers, batch normalization layers, and LeakyReLU to extract feature maps from the image. Correspondingly, the decoder applies several transposed convolutional layers, batch normalization layers, and ReLU to restore the low-resolution feature maps to their original resolution, thus obtaining the predicted shadowless background image. Furthermore, skip connections are used between the encoder and decoder to connect the feature maps of the encoder stage with the corresponding feature maps of the decoder stage.

[0013] Furthermore, it also includes constructing a background reconstruction loss function to train the background extraction network from real shadowless images I. gt Background image B is extracted as the label, and the background reconstruction loss function L... background The backgroundless image generated by the computational background extraction network The mean absolute loss (MAE) between the background image B and the real background image B is used to constrain the background extraction network to obtain the ideal background image.

[0014]

[0015] Furthermore, from a true shadowless image I gt Extracting background image B includes:

[0016] Real shadowless image I gt Divide into several patches to obtain local background.

[0017] For each patch, the image is clustered into text content category and background category based on pixel intensity, and the average value of the background category pixels is used as the background color of the patch;

[0018] Optimize by maintaining color smoothness operator To obtain the real background image B, the pixel values ​​in the real background image B can be calculated using the following formula:

[0019]

[0020] Where N(i) is the local neighborhood of pixel i, W ij It is a filter kernel used to measure the color similarity between pixels i and j. Filter kernel W ij Specifically:

[0021]

[0022] μ k and is I in N(i) gt The mean and variance of , |ω| is the number of pixels in N(i), and ε is the value to prevent W ij Excessively large regularization parameters.

[0023] Furthermore, in step S3, the generator network model in its first stage uses DenseUnet as the basic network architecture to construct a background constraint encoder-decoder; the encoder includes convolution and nonlinear transformation operations to downsample the input image and extract image features, and the decoder uses deconvolution operations to take the downsampled image feature results as input and reconstructs the image through upsampling, thereby obtaining the initial shadow removal result image I. coarse .

[0024] Furthermore, in step S3, the second stage of the generator network model uses DenseUnet as the basic network architecture. A background attention module is set in the hierarchical connection between the encoder and decoder. The attention module includes convolutional layers, LeakyReLu activation functions, batch normalization layers, and residual blocks. This attention module is used to input fused encoder features x. f and background features The integrated features obtained by channel cascading are used to generate a color-perceived attention map. Then, the color-perceived attention map and the integrated features are fused through element-wise multiplication to reconstruct the features and embed them into the corresponding decoder; among which, background features... It is obtained from the predicted shadowless background image through convolutional layers, batch normalization layers, and the LeakyReLu activation function. The result was obtained by extracting from it.

[0025] Furthermore, in step S3, the second stage of the generator network model uses DenseUnet as the basic network architecture, sets up a detail enhancement module, and restores the texture details of the first stage result through the low-level features of the network. The detail enhancement module includes two parts: one is feature statistics to obtain statistical information of low-level features, and the other is feature equalization to enhance texture details.

[0026] Furthermore, in step S4, the discriminator network model uses a PatchGAN network, which consists of several convolutional layers. The last convolutional layer maps the input to a matrix, and the mean of the matrix is ​​used as the output of the discriminator network model to evaluate the shadow removal result image I generated by the generator network model. final The generator network model is evaluated and its errors are fed back, allowing the generator network model to optimize and adjust its parameters.

[0027] Furthermore, in step S6, the formula for the loss function is expressed as follows:

[0028] Loss total =L appearance +L stucture +L adv

[0029] L appearance For the loss of appearance consistency, L stucture For structural consistency loss, L adv To combat the losses;

[0030] Appearance consistency loss L appearance Add the absolute values ​​of the differences between each pixel in the generated image and the target image:

[0031] L appearance =λ1L coarse +λ2L final

[0032] =λ1||I gt -I coarse ||1+λ2||I gt -I final ||1

[0033] λ1 and λ2 are weight parameters, I coarseThis is a rough result produced in the first stage, I final This is the final shadow removal result produced in the second stage;

[0034] Structural consistency loss L stucture The aim is to preserve the image structure, represented as:

[0035]

[0036] λ3 is the weight parameter, and VGG(·) is the feature extractor of the VGG19 model pre-trained on the ImageNet dataset;

[0037] Combat loss L adv Designed for the discriminator to determine the authenticity of the generated result, it is represented as:

[0038] L adv =λ4E[log(1-D(G(I)))+logD(I)] gt )]

[0039] λ4 is the weight parameter, D is the discriminator, and I is the shadow image.

[0040] Furthermore, in step S6, during network training, the backpropagation algorithm is used to update the parameters of the generator network model and the discriminator network model. The update is performed in an alternating iterative manner, first updating the parameters of the discriminator network, then updating the parameters of the generator network, and iterating until the preset number of training iterations is reached.

[0041] The beneficial technical effects of the present invention are as follows:

[0042] This invention provides a background-guided text image shadow removal method that constructs a novel background-guided two-stage generator network model. First, a background extraction network is built to extract a spatially varying colored background for the text image. This background preserves the different background colors in the image. The spatially varying background provides more effective color information for the subsequent shadow removal network. The background extracted by the background extraction network is shadow-free, which helps the shadow removal network learn more shadow-free features. The spatially varying background contains effective color information and shadow-free features, which is beneficial for the shadow removal task while better avoiding lighting or color artifacts in the image.

[0043] The constructed generator network model consists of two stages. In the first stage, a background-constrained decoder is introduced, combining background encoder features with image features. This helps the first-stage shadow removal network generate realistic shadow-free images. In the second stage, a background-based attention module and a texture enhancement module are used to refine the results of the first stage. The background-based attention module aims to eliminate lighting and color inconsistencies in the image by using an attention mechanism; the texture enhancement module aims to enhance low-level detail features, resulting in a highly realistic, high-contrast, texture-clear, and visually appealing shadow-free image.

[0044] Furthermore, through training and optimization of the loss function, the final generator network model and discriminator network model can process shadow images more naturally and realistically. Attached Figure Description

[0045] Figure 1 This is a flowchart illustrating the main process of model training for the background-guided text image shadow removal method in this embodiment of the invention.

[0046] Figure 2 This is a flowchart illustrating the model training process for the background-guided text image shadow removal method in this embodiment of the invention.

[0047] Figure 3 These are schematic diagrams of the images before and after processing of a shadow image using the background-guided text image shadow removal method in this embodiment of the invention, where (a) is the image before processing and (b) is the image after processing.

[0048] Figure 4 This is a flowchart of the detail enhancement module steps in an embodiment of the present invention. Detailed Implementation

[0049] To further understand the present invention, preferred embodiments of the present invention are described below in conjunction with examples. However, it should be understood that these descriptions are only for further illustrating the features and advantages of the present invention, and not for limiting the scope of the claims of the present invention.

[0050] See Figures 1-3 This embodiment provides a text image shadow removal method based on background guidance, including the following steps:

[0051] Step 1: Create a real-world text image shadow dataset X, and a corresponding real-world shadow-free image dataset Y. Specifically, this includes:

[0052] Using a camera with a tripod and a wireless camera remote, images of various paper documents, such as papers, books, and brochures, were captured under different lighting conditions and in different scenarios with fixed built-in parameters. During shooting, obstructions were placed and removed to capture a shadowed image (I) and a corresponding true shadowless image (Ishadowless). gt By using occlusions of different shapes and angles, multiple sets of images were captured. Each set of images contains one shadowed image I and one corresponding true, shadowless image I. gt .

[0053] In this embodiment, a total of 4916 real-scene shadow images I were collected, corresponding to real shadowless images I. gt A total of 4916 images, or 4916 sets of image data, were collected. 4371 sets of images were randomly assigned as the training set, and the remaining 545 sets were used as the test set for model scoring.

[0054] Step 2: Construct a background extraction network to obtain the shadowless background image corresponding to the input shadow image I. and background encoder features x f Specifically, this includes:

[0055] 2.1) Constructing a background extraction network

[0056] The background extraction network consists of an encoder and a decoder, which take the shadow image I as input and output the corresponding spatially varied shadowless background image. A spatially varying background contains effective color information and shadow-free features, which helps in shadow removal tasks while better avoiding lighting or color artifacts in the image.

[0057] Specifically, in this embodiment, the background extraction network first uses five convolutional layers + batch normalization layers + LeakyReLU as an encoder to extract feature maps from the image. Correspondingly, it uses five transposed convolutional layers + batch normalization layers + ReLU as a decoder to restore the low-resolution feature maps to the original resolution, thus obtaining the predicted shadowless background image. Skip connections are used between the encoder and decoder to link feature maps from the encoder stage to their corresponding feature maps from the decoder stage, helping to address the common low-level feature loss problem in dense image prediction tasks. Background encoder features x are generated during the downsampling of the input image using the encoder. f Background encoder features x f This will serve as auxiliary information for the generator network in subsequent steps.

[0058] 2.2) Using a real, shadowless image I gt Background image B is extracted as a label and used to train the background extraction network.

[0059] To train the background extraction network proposed in this chapter, a local-to-global strategy is employed to construct realistic backgrounds as labels for supervised training. Specifically, a realistic, shadowless image I is first used as the label. gt Divide into 16x16 patches to capture local background. For each patch, a Gaussian mixture model is used to cluster it into two categories based on pixel intensity: text content and background. Considering that document background colors are typically brighter, the category with higher pixel intensity is considered the background category, and the average pixel value of the background category is used as the background color of the patch. Since different patches have different background colors, local background... Typically, there are obvious patch boundaries. Since bilateral filtering achieves edge-preserving denoising by combining spatial proximity and pixel value similarity of the image, a color-preserving smoothing operator is used here for optimization. To obtain the desired realistic background image B. Specifically, the pixel values ​​in the realistic background image B can be calculated using the following formula:

[0060]

[0061] Where N(i) is the local neighborhood of pixel i, W ij It is a filter kernel used to measure the color similarity between pixels i and j, since the image I has no shadows. gt It has edge information, therefore through I gt To calculate the filter kernel W ij Specifically:

[0062]

[0063] Where, μ k and is I in N(i) gt The mean and variance of , |ω| is the number of pixels in N(i), and ε is the value to prevent W ij Excessively large regularization parameters.

[0064] 2.3) Training the Background Extraction Network

[0065] To obtain an ideal background image, the background reconstruction loss L... background The function can be used to limit the output of the background extraction network, and this limitation can be achieved based on the reconstruction error of the background image. Background reconstruction loss L background Background image generated by the function computation background extraction network The mean absolute error (MAE) between the ground truth background image B and the ground truth background image B is:

[0066]

[0067] To minimize the loss function, the gradient of each parameter in the model with respect to the loss function is calculated using the backpropagation algorithm. Then, the gradient descent algorithm is used to update the network parameters based on the gradient calculation results, thereby minimizing the loss function.

[0068] Step 3: Construct the generator network model, which consists of two stages;

[0069] In the first stage, a background-constrained decoder is constructed, which converts the background encoder features x f Combined with image features extracted from the input image, to generate an initial shadow removal result I coarse .

[0070] In the second stage, a background-based attention module and a texture enhancement module are used to further refine the results of the first stage. The background-based attention module utilizes an attention mechanism to eliminate inconsistencies in lighting and color in the image; the texture enhancement module aims to enhance low-level detail features, thereby improving the shadow removal effect. Specifically:

[0071] Phase 1 network:

[0072] The DenseUnet network architecture is adopted as the basic network architecture. DenseUnet possesses powerful feature extraction capabilities and a small parameter count. The encoder consists of convolutional and nonlinear transformation operations to downsample the input image and extract image features. The decoder uses deconvolution operations to take the downsampled image features as input, and reconstructs the image through upsampling to obtain the initial shadow-removed result image I. coarse .

[0073] The encoder contains 5 dense blocks and downsampling blocks, while the decoder contains 5 dense blocks and upsampling blocks. Each dense block consists of multiple composite layers, each comprising a batch normalization layer, an activation function, a 3×3 convolutional layer, and a Dropout layer with a deactivation rate of 0.2. Each layer receives the outputs of all preceding layers as input and concatenates its own output with the outputs of the preceding layers. The downsampling module consists of a batch normalization layer, an activation function, a 1×1 convolutional layer, a Dropout layer with a deactivation rate of 0.2, and a 2×2 max-pooling layer, used to reduce the size and number of feature maps. The upsampling module consists of 3×3 transposed convolutions with a stride of 2.

[0074] The shadow image I is input into the background extraction network to obtain the predicted shadowless background image. and background encoder features x f As auxiliary information for the generator network, the shadow image I and the background encoder features x are used. f The first-stage network model of the input generator obtains the initial shadow removal result I.coarse .

[0075] In this embodiment, a background constraint decoder is used instead of the standard decoder to combine the image features obtained in the encoder stage with the background encoder features x. f By stitching together feature information from different levels, the combined feature is input into the background constraint decoder. The combined feature can supplement the image features and help produce satisfactory shadow removal results.

[0076] Second-stage network:

[0077] In this embodiment, DenseUnet is used as the basic network architecture, and better shadow removal results are generated through background-based attention modules and detail enhancement modules.

[0078] A background attention module is added to the hierarchical connection between the encoder and decoder. The attention module first fuses the encoder features x. f and background features Among them, background features It uses five convolutional layers + batch normalization layers + LeakyReLU to predict images without shadows. The extracted features are combined through channel concatenation. These combined features are then input into an attention computation unit to generate a color-aware attention map. The attention computation unit consists of convolutional layers, a LeakyReLu activation function, a batch normalization layer, and residual blocks. Finally, the color-aware attention map and the combined features are fused element-wise through multiplication to reconstruct the features, which are then embedded into the corresponding decoder. The color-aware attention map enables the network to adaptively focus on regions with similar backgrounds, promoting visual consistency among these regions.

[0079] See details Figure 4Because the network contains multiple convolution and downsampling operators, some detailed information is lost at higher levels, resulting in blurred details. Compared to high-level features, low-level features in convolutional layers typically contain more texture details. These low-level features usually refer to the original pixel information in the input image and the more basic image features extracted through convolution operations, such as edges, corners, and textures. High-level features, on the other hand, are abstract, global image features obtained by performing multiple convolution and convergence operations on the basis of low-level features. Therefore, to address this issue, a detail enhancement module is proposed that utilizes the network's low-level features to recover the texture details of the first-stage results. As is well known, the statistical texture information of an image reflects the intensity of the texture to a certain extent. Therefore, the proposed detail enhancement module consists of two parts: feature statistics to obtain statistical information of low-level features, and feature equalization to enhance texture details. Specifically, by fusing features from the first two low-level layers of the encoder, cascaded low-level features F are obtained, which are then fed into the detail enhancement module for statistical analysis. The purpose of feature counting is to obtain the quantized encoded map and statistical features. First, two 2×2 convolutional layers are used to generate a feature map M, and then global average pooling is performed to obtain the global average feature of M. And use cosine similarity to calculate M and The correlation between the two values ​​is denoted as S. To efficiently quantize and statistically analyze the data, a set of quantization levels L is constructed, dividing the range of S's minimum and maximum values ​​into N equal parts. Then, the correlation matrix S can be quantized into a quantization encoding matrix E using L:

[0080]

[0081] Where i∈[1,HW] and n∈[1,N]. H and W are the length and width of the image. n It is the nth level of L, S i This is the i-th row of S. In this embodiment, N = 128 is set.

[0082] To prevent gradient vanishing (a situation where gradients gradually decrease during backpropagation in a neural network, leading to very slow parameter updates in deep networks and eventual failure to converge), matrix E is normalized. The normalized result and the quantization level L are integrated into a quantization count map C, reflecting the relative statistical data of low-level input features. Due to the concatenation operation, C has 2 channels. Two 1×1 convolution operations are performed on C to increase the number of channels, and then... A concatenation operation is performed to further obtain absolute statistical information H. Low-level texture details are enhanced by feature equalization and by reconstructing a new set of quantization levels. First, a 1×1 convolution operation is performed on H to obtain G. Then, matrix multiplication is performed on G and its transpose, followed by SoftMax operation to construct a learnable neighbor matrix A. Matrix A can be viewed as a similarity coefficient matrix. A new quantization level L′ is reconstructed using matrix multiplication of A and G. Based on the reconstructed quantization level L′, feature equalization is performed on the original quantization encoding matrix E to enhance detail features. The enhanced features R can be obtained by matrix multiplication of quantization level L′ and matrix E. The entire process is as follows: Figure 3 As shown, by using enhanced texture detail, the decoder can easily capture detailed information.

[0083] Shadow image I and initial shadow removal result I coarse And predicted shadowless background image and background features As input to the second-stage network model, the output is the final shadow removal result image I. final ;

[0084] Step 4: Construct a discriminator network model.

[0085] The discriminator is used to distinguish between generated data and real data, determining whether the input data is real data or fake data generated by the generator. The resulting image after shadow removal (I) final And the corresponding shadowless image I in the training set gt As input, the final shadow removal result image I is processed by the discriminator network model. final To identify.

[0086] In this embodiment, the discriminator network model uses the PatchGAN network, which consists of 5 convolutional layers with a kernel size of 4×4 and a stride of 1. The output dimensions of the first 4 convolutional layers are 64, 128, 256, and 512, respectively. The last convolutional layer maps the input to a 30×30 matrix and uses the mean of the matrix as the output of the discriminator network model.

[0087] The role of the patch discriminator is to divide the input image into multiple overlapping small regions (also called "patches") and perform binary classification on each region to determine whether it is a real, shadowless portion of the image. Unlike the global discriminator, the patch discriminator focuses on local details of the image, effectively improving the realism of the image generation model. The patch discriminator evaluates the images generated by the generator and provides feedback on the generator's errors, allowing the generator to gradually adjust its parameters and generate more realistic images. Simultaneously, the patch discriminator also provides better feedback by comparing the image with real images, helping the generator to gradually approximate the distribution of the real image.

[0088] Step 5: Construct a loss function to train and optimize the generator network model and the discriminator network model, and obtain the optimized generator network model and discriminator network model.

[0089] The formula for the loss function is as follows:

[0090] Loss total =L appearance +L stucture +L adv

[0091] Where L appearance For the loss of appearance consistency, L stucture For structural consistency loss, L adv To combat the losses.

[0092] The appearance consistency loss sums the absolute values ​​of the differences between each pixel in the generated image and the target image:

[0093] L appearance =λ1L coarse +λ2L final

[0094] =λ1||I gt -I coarse ||1+λ2||I gt -I final ||1

[0095] Where λ1 and λ2 are weight parameters. coarse These are preliminary results from the first phase, I final This is the final shadow removal result produced in the second stage.

[0096] Structural consistency loss aims to preserve image structure, and its calculation formula is as follows:

[0097]

[0098] Where λ3 is the weight parameter, and VGG(·) is the feature extractor of the VGG19 model pre-trained on the ImageNet dataset.

[0099] Adversarial loss is designed for the discriminator to determine the authenticity of the generated results, and is represented as:

[0100] L adv =λ4E[log(1-D(G(I)))+logD(I)] gt )]

[0101] Where λ4 is the weight parameter, E represents the expected value of the distribution function, D is the discriminator, G is the generator, and I is the shadow image. gt It is a true image without shadows.

[0102] During network training, the weight parameters λ1, λ2, λ3, and λ4 are set to 1, 1, 0.05, and 0.01, respectively.

[0103] When training the discriminator, a batch of real, shadowless images I are first used. gt And a batch of final shadow-removed result images generated by the generator I final The input is fed into the discriminator, and the loss function of the discriminator is calculated. The gradient of the discriminator parameters with respect to the discriminator loss function is calculated through backpropagation, and then the gradient descent algorithm is used to update the parameters of the discriminator network.

[0104] When training the generator, first use the generator to generate a batch of shadow removal result images I final The generator loss function is then calculated and fed into the discriminator. The gradient of the generator parameters with respect to the generator loss function is obtained through backpropagation, and then the gradient descent algorithm is used to update the parameters of the generator network.

[0105] By training the discriminator and generator alternately, the shadow-removed image I generated by the generator can be improved. final More realistic, and at the same time, it enables the discriminator to more accurately distinguish real, shadowless images. gt and the generated shadow removal result image I final By iterating through this process, the generator network and discriminator network are continuously trained until a preset number of training iterations are reached, allowing them to gradually reach their optimal state and obtain a shadowless image with good visual effects, high contrast, and natural realism.

[0106] The optimized generator network model in this embodiment is used to process the shadow image. Figure 2 This illustrates a specific case, where (a) is the image before processing and (b) is the image after processing. Figure 2As shown, this invention provides a background-guided text image shadow removal method that can effectively eliminate shadows in images and generate realistic, natural shadow-free images. Compared to existing technologies, the shadow-free image processing effect of this invention is superior.

[0107] The above description of the embodiments is only for the purpose of helping to understand the method and core ideas of the present invention. It should be noted that those skilled in the art can make several improvements and modifications to the present invention without departing from the principles of the present invention, and these improvements and modifications also fall within the protection scope of the claims of the present invention.

Claims

1. A method for removing shadows from text images based on background guidance, characterized in that, include: S1: Create a realistic shadow image of the scene. The text image shadow dataset X consists of the actual shadowless images and the corresponding real images. The dataset Y, consisting of shadowless images; S2: Construct a background extraction network to obtain the input shadow image. Corresponding background image without shadow and background encoder features The background extraction network contains an encoder and a decoder for extracting shadow images. As input, the output is a spatially varied image without a shadow. The spatially varying background contains effective color information and shadow-free features; S3: Construct a background-guided generator network model, which consists of two stages; the first stage is based on the shadow image. and background encoder features Obtain initial shadow removal results ; The second stage uses a background-based attention module and a texture enhancement module to refine the initial shadow removal results from the first stage. This yields the final shadow removal result image. The second stage uses DenseUnet as the basic network architecture, and sets up a background attention module in the jump connection between the encoder and decoder. The attention module includes convolutional layers, LeakyReLu activation function, batch normalization layer and residual block, and integrates encoder features through input. and background features The integrated features obtained by channel cascading are used to generate a color-perceived attention map. Then, the color-perceived attention map and integrated features are fused through element-wise multiplication to reconstruct the features and embed them into the corresponding decoder; among which, background features... It is obtained from the predicted shadowless background image through convolutional layers, batch normalization layers, and the LeakyReLu activation function. Extracting the obtained data; setting up a detail enhancement module and using the low-level features of the network to recover the texture details of the first-stage result; the detail enhancement module consists of two parts: one is feature statistics to obtain statistical information of low-level features, and the other is feature equalization to enhance texture details; the features from the first two low-level layers of the encoder are fused to obtain cascaded low-level features F, which are then fed into the detail enhancement module for statistical analysis; firstly, two 2×2 convolutional layers are used to generate feature maps M, and global average pooling is performed to obtain the global average features of M. ; and use cosine similarity to calculate M and The correlation between them is denoted as S; a set of quantization levels L is constructed, dividing the range of the minimum and maximum values ​​of S into N equal parts; then, S is quantized into a quantization coding matrix E using L: in, and H and W are the length and width of the image. It is the nth level of L. This is the i-th row of S; normalize matrix E, and integrate the normalized result with the quantization level L into a quantization count map C, reflecting the relative statistical data of low-level input features; perform two 1×1 convolution operations on C to increase the number of channels, and then combine it with... A concatenation operation is performed to obtain absolute statistical information H; a 1×1 convolution operation is performed on H to obtain G, and matrix multiplication is performed on G and its transpose, followed by SoftMax operation to establish a learnable neighbor matrix A; the matrix multiplication of A and G is used to reconstruct a new quantization level. Quantitative levels based on reconstruction Feature equalization is performed on the original quantized encoding matrix E to enhance detailed features; S4: Construct a discriminator network model to distinguish the generated shadow removal result images. With real shadowless images ; S5: Construct a loss function to train and optimize the generator network model and the discriminator network model, and obtain the optimized generator network model and discriminator network model to perform shadow removal on the shadow image to be processed.

2. The text image shadow removal method based on background guidance according to claim 1, characterized in that, In step S2, the background extraction network includes an encoder and a decoder. The encoder extracts feature maps from the image using several convolutional layers, batch normalization layers, and LeakyReLU. Correspondingly, the decoder uses several transposed convolutional layers, batch normalization layers, and ReLU to restore the low-resolution feature maps to their original resolution, thus obtaining the predicted shadowless background image. Furthermore, skip connections are used between the encoder and decoder to connect the feature maps of the encoder stage with the corresponding feature maps of the decoder stage.

3. The text image shadow removal method based on background guidance according to claim 2, characterized in that, It also includes constructing a background reconstruction loss function to train the background extraction network from real shadowless images. Extract background image As a label, the background reconstruction loss function The backgroundless image generated by the computational background extraction network and real background images The mean absolute error between the two is used to constrain the background extraction network to obtain an ideal background image, where, 。 4. The text image shadow removal method based on background guidance according to claim 3, characterized in that, From real shadowless images Extract background image include: Realistic shadowless images Divide into several patches to obtain local background. ; For each patch, the image is clustered into text content category and background category based on pixel intensity, and the average value of the background category pixels is used as the background color of the patch; Optimize by maintaining color smoothness operator To obtain a realistic background image Real background image The pixel values ​​in the image can be calculated using the following formula: ; in, It is the local neighborhood of pixel i. It is a filter kernel used to measure the color similarity between pixels i and j.

5. The text image shadow removal method based on background guidance according to claim 1, characterized in that, In step S3, the generator network model in its first stage uses DenseUnet as the basic network architecture to construct a background constraint encoder-decoder. The encoder includes convolution and nonlinear transformation operations to downsample the input image and extract image features. The decoder uses deconvolution operations to take the downsampled image feature results as input and reconstructs the image through upsampling, thereby obtaining the initial shadow removal result image. .

6. The text image shadow removal method based on background guidance according to claim 1, characterized in that, In step S4, the discriminator network model uses a PatchGAN network, which consists of several convolutional layers. The last convolutional layer maps the input to a matrix, and the mean of the matrix is ​​used as the output of the discriminator network model to evaluate the shadow removal result image generated by the generator network model. The generator network model is evaluated and its errors are fed back, allowing the generator network model to optimize and adjust its parameters.

7. The text image shadow removal method based on background guidance according to claim 1, characterized in that, In step S6, the loss function is expressed as follows: ; For the loss of appearance consistency, For structural consistency loss, To combat the losses; Loss of appearance consistency Add the absolute values ​​of the differences between each pixel in the generated image and the target image: ; ; and These are weight parameters. These are rough results from the first stage. This is the final shadow removal result produced in the second stage; Structural consistency loss The aim is to preserve the image structure, represented as: ; These are weight parameters. It is a feature extractor for a VGG19 model that has been pre-trained on the ImageNet dataset; Combating losses Designed for the discriminator to determine the authenticity of the generated result, it is represented as: ; These are weight parameters. G is the discriminator, and G is the generator. It is a shadow image. It is a true image without shadows.

8. The text image shadow removal method based on background guidance according to claim 1, characterized in that, In step S6, during network training, the backpropagation algorithm is used to update the parameters of the generator network model and the discriminator network model. The update is performed in an alternating iterative manner, first updating the parameters of the discriminator network, then updating the parameters of the generator network, and iterating until the preset number of training iterations is reached.