Image processing method and device, storage medium and electronic device
By combining a two-level cascaded neural network and multiple loss functions, the problems of shadow removal causing side effects on the background layer and high hardware requirements are solved, achieving accurate removal of shadow areas and speed improvement, which is suitable for mobile terminals such as mobile phones.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ARCSOFT CORP LTD
- Filing Date
- 2021-10-18
- Publication Date
- 2026-06-19
Smart Images

Figure CN116012232B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to image processing technology, and more specifically, to an image processing method, apparatus, storage medium, and electronic device. Background Technology
[0002] When people take photos of documents with their mobile phones, shadows are often left on the documents due to the light being blocked by their hands and the phone, as well as by other objects in the environment. This affects the visual experience of the captured images. By processing the captured images with computer vision technology to eliminate shadows and restore the text and pictures behind the shadows, the image quality can be effectively improved. Therefore, document shadow elimination is a significant technology that can greatly improve the quality of captured images and has broad market prospects.
[0003] Effectively eliminating shadow layers without causing significant side effects on the background layer, while also having fast running speed and acceptable hardware configuration requirements, are the basic requirements and main challenges for shadow removal methods to be applied to mobile phones. Current shadow removal methods either fail to remove shadows completely, lose information from the background layer, or have slow running speeds, all of which are not conducive to use by ordinary users.
[0004] One existing shadow removal method uses a neural network with three modules: a global localization module, an appearance modeling module, and a semantic modeling module. The global localization module detects shadow regions and obtains their location features; the appearance modeling module learns features of non-shadow regions, ensuring the network output matches the labeled data (Ground Truth, GT) in non-shadow areas; and the semantic modeling module recovers the original content behind the shadow. However, this method does not directly output the background image after shadow removal. Instead, it outputs the ratio of the shadow image to the background image, requiring further pixel-by-pixel division of the shadow image with the network output to obtain the background image. This introduces a larger computational load, and the division may affect computational stability due to division by zero.
[0005] Therefore, it is necessary to propose an image processing technique that can effectively eliminate shadows without producing significant side effects on the background layer, while also having a fast running speed and acceptable hardware configuration requirements. Summary of the Invention
[0006] The present invention provides an image processing method, apparatus, storage medium, and electronic device to at least solve the technical problems in the prior art that easily produce side effects on the image background layer while eliminating shadow areas and have high requirements for hardware platforms.
[0007] According to one aspect of the present invention, an image processing method is provided, comprising: acquiring an image to be processed containing a shadow region; inputting the image to be processed into a trained neural network to obtain a shadow-removed image; wherein the neural network comprises a first-level network and a second-level network cascaded together, the first-level network receiving the image to be processed and outputting a shadow region mask image, and the second-level network simultaneously receiving the image to be processed and the shadow region mask image and outputting the shadow-removed image.
[0008] Optionally, the first-level network includes: a first feature extraction module, comprising a first encoder, for extracting features of the image to be processed layer by layer to obtain a first set of feature data; and a shadow region estimation module, connected to the output of the first feature extraction module, comprising a first decoder, for estimating the shadow region based on the first set of feature data and outputting a shadow region mask map.
[0009] Optionally, the second-level network includes: a second feature extraction module, which includes a second encoder connected to the output of the first-level network, and receives the shadow region mask map output by the first-level network while receiving the image to be processed, for obtaining a second set of feature data; and a result image output module, which is connected to the output of the second feature extraction module and includes a second decoder for outputting a de-shadowed image based on the second set of feature data.
[0010] Optionally, the outputs of each layer of the first decoder or the second decoder are spliced along the channel axis with the outputs of the corresponding layers of the first encoder or the second encoder through cross-layer connections. A multi-scale pyramid pooling module is added to the cross-layer connections of the first decoder or the second decoder and the first encoder or the second encoder. The multi-scale pyramid pooling module fuses features of different scales.
[0011] Optionally, after obtaining the image to be processed containing the shadow region, the image processing method further includes: downsampling the image to be processed using an image pyramid algorithm, and saving the gradient information of each layer to form a Laplacian pyramid while downsampling; feeding the smallest layer into a trained neural network to obtain an output image; and using the Laplacian pyramid to reconstruct the output image from low resolution to high resolution to obtain a shadow-free image.
[0012] Optionally, the above image processing method further includes: constructing an initial neural network; training the initial neural network using sample data to obtain a trained neural network, wherein the sample data includes real-shot images and synthetic shadow images, and the synthetic shadow images are synthesized using an image synthesis method using pure shadow images and shadowless images.
[0013] Optionally, synthesizing the above-mentioned composite shadow image using an image synthesis method from a pure shadow image and a shadowless image includes: obtaining a pure shadow image; obtaining a shadowless image; and obtaining a composite shadow image based on the pure shadow image and the shadowless image.
[0014] Optionally, the method of synthesizing the above-mentioned composite shadow image using a pure shadow image and a shadowless image further includes: transforming the pure shadow image, and obtaining the composite shadow image based on the transformed pure shadow image and the shadowless image, wherein the pixel values of the non-shadow areas in the transformed pure shadow image are uniformly set to a fixed value 'a', and the pixel values of the shadow areas are values between 0 and 'a', where 'a' is a positive integer.
[0015] Optionally, the initial neural network also includes a module for classifying the sample data. When it is determined that the sample data input to the initial neural network is a real-world image, the labeled data is a shadow-free image captured from the real scene. The parameters inside the second-level network are adjusted based on the difference between the shadow-free image output by the initial neural network and the shadow-free image used as labeled data. When it is determined that the sample data input to the initial neural network is a synthetic shadow image, the labeled data includes shadowless images and pure shadow images captured from the real scene. The parameters inside the first-level network are adjusted based on the difference between the shadow area mask image and the pure shadow image. The parameters inside the second-level network are adjusted based on the difference between the shadow-free image output by the initial neural network and the shadowless image.
[0016] Optionally, when training the initial neural network using sample data, the loss function includes at least one of the following: pixel loss, feature loss, structural similarity loss, adversarial loss, shadow edge loss, and shadow brightness loss.
[0017] Optionally, the pixel loss includes pixel truncation loss. When the absolute difference between two corresponding pixels in the output image of the initial neural network and the label image is greater than a given threshold, the loss of the two pixels is calculated; when the absolute difference between two corresponding pixels in the output image of the initial neural network and the label image is not greater than a given threshold, the difference between the two pixels is ignored.
[0018] Optionally, the shadow brightness loss is performed such that the difference between the brightness of the region corresponding to the shadow area in the deshaded image output by the neural network and the brightness of the shadow area in the input image to be processed is greater than 0, thereby improving the brightness of the region corresponding to the shadow area in the deshaded image.
[0019] Optionally, when the loss function includes shadow edge loss, the above image processing method includes: dilating the shadow region mask image to obtain an inflated image; eroding the shadow region mask image to obtain an eroded image; obtaining the difference between the inflated image and the eroded image as the boundary region between shadows and non-shadows, and smoothing it using TVLoss.
[0020] According to another aspect of the present invention, an image processing apparatus is also provided, comprising: an image acquisition unit for acquiring an image to be processed containing a shadow region; and a processing unit for receiving the image to be processed and processing the image to be processed using a trained neural network to obtain a shadow-removed image; wherein the neural network comprises a first-level network and a second-level network cascaded together, the first-level network receiving the image to be processed and outputting a shadow region mask image, and the second-level network simultaneously receiving the image to be processed and the shadow region mask image and outputting the shadow-removed image.
[0021] Optionally, the first-level network includes: a first feature extraction module, comprising a first encoder, for extracting features of the image to be processed layer by layer to obtain a first set of feature data; and a shadow region estimation module, connected to the output of the first feature extraction module, comprising a first decoder, for estimating the shadow region based on the first set of feature data and outputting a shadow region mask map.
[0022] Optionally, the second-level network includes: a second feature extraction module, which includes a second encoder connected to the output of the first-level network, and receives the shadow region mask map output by the first-level network while receiving the image to be processed, for obtaining a second set of feature data; and a result image output module, which is connected to the output of the second feature extraction module and includes a second decoder for outputting a de-shadowed image based on the second set of feature data.
[0023] According to another aspect of the present invention, a storage medium is also provided, including a stored program, wherein the program controls the device where the storage medium is located to execute the image processing method described in any one of the above embodiments when the program is running.
[0024] According to another aspect of the present invention, an electronic device is also provided, comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the image processing method described in any one of the preceding embodiments by executing the executable instructions.
[0025] This invention proposes a fast and effective shadow removal method applicable to mobile terminals such as smartphones. It captures the characteristics of the physical phenomenon of shadows, synthesizes training materials with a strong sense of realism, and combines various loss functions and effective network structures and modules for training to achieve good shadow removal results. Taking into account the high resolution of images captured by mobile terminals such as smartphones, this invention adopts downsampling technology and network pruning technology, which can still achieve a fast processing speed on high-resolution images. Attached Figure Description
[0026] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this application, illustrate exemplary embodiments of the invention and, together with their description, serve to explain the invention and do not constitute an undue limitation thereof. In the drawings:
[0027] Figure 1 This is a flowchart of an optional image processing method according to an embodiment of the present invention;
[0028] Figure 2 This is a structural diagram of an optional neural network according to an embodiment of the present invention;
[0029] Figure 3 This is a flowchart of an optional training neural network according to an embodiment of the present invention;
[0030] Figure 4 This is a flowchart of an optional image synthesis method according to an embodiment of the present invention;
[0031] Figures 5(a) and 5(b) are comparison images of the effect of removing shadows using the image processing method of the present invention.
[0032] Figure 6 This is a structural block diagram of an optional image processing apparatus according to an embodiment of the present invention. Detailed Implementation
[0033] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.
[0034] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such orders can be interchanged where appropriate so that the embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0035] The following is a flowchart illustrating an optional image processing method according to an embodiment of the present invention. It should be noted that the steps shown in the flowchart can be executed in a computer system such as a set of computer-executable instructions, and although a logical order is shown in the flowchart, in some cases the steps shown or described may be executed in a different order than that shown here.
[0036] refer to Figure 1 This is a flowchart of an optional image processing method according to an embodiment of the present invention. Figure 1 As shown, the image processing method includes the following steps:
[0037] S100, Obtain the image to be processed containing the shaded area;
[0038] S102, the image to be processed is input into the trained neural network to obtain the deshaded image; wherein, the neural network includes a first-level network and a second-level network in two cascaded stages. The first-level network receives the image to be processed and outputs a shadow area mask map, and the second-level network simultaneously receives the image to be processed and the shadow area mask map and outputs the deshaded image.
[0039] The image processing method described above can be used to obtain accurate shadow area boundaries, and the resulting de-shadowed image can smoothly transition between shadows and non-shadows.
[0040] In one alternative embodiment, such as Figure 2 As shown, the neural network comprises a first-stage network 20 and a second-stage network 22, cascaded together. The first-stage network includes a first feature extraction module 200 and a shadow region estimation module 202, while the second-stage network includes a second feature extraction module 204 and a result image output module 206. Specifically, the first feature extraction module 200 includes a first encoder for extracting features from the image to be processed layer by layer to obtain a first set of feature data; the shadow region estimation module 202, connected to the output of the first feature extraction module 200, includes a first decoder for estimating shadow regions based on the first set of feature data and outputting a shadow region mask map; the second feature extraction module 204 includes a second encoder, connected to the output of the first-stage network, and receives the shadow region mask map output by the first-stage network while receiving the image to be processed, for obtaining a second set of feature data; the result image output module 206, connected to the output of the second feature extraction module 204, includes a second decoder for outputting a de-shadowed image based on the second set of feature data. This two-stage cascaded neural network enhances the shadow removal effect. In an alternative embodiment, the first-level network and the second-level network have the same structure except for the number of input channels; for example, they can be built based on the classic segmentation network UNet.
[0041] The outputs of each layer of the two encoders are concatenated along the channel axis with the outputs of the corresponding layers of the two decoders via cross-layer connections. A multi-scale pyramid pooling module is added to the cross-layer connections between the encoder and decoder. This module includes multiple pooling layers, convolutional layers, and interpolation upsampling layers with different kernel sizes. First, pooling layers extract features at different scales; then, convolutional layers extract low-level and / or high-level features; next, interpolation upsampling layers adjust the outputs of the corresponding layers of the encoder and decoder to the same size; finally, these are concatenated along the channel axis to form a single feature. Since the influence and area of shadows vary greatly across different images, the determination of shadow regions must consider both local texture features and global semantic information. The multi-scale pyramid pooling module fuses features at different scales, enhancing the network's generalization ability and enabling it to achieve good results on shadow maps of varying areas and intensities.
[0042] To improve the model's running speed on the device, the model can be pruned by replacing the convolutional layers in the encoder with grouped convolutions, where each convolutional kernel convolves only one channel, thereby reducing the model's computational load and improving processing speed.
[0043] To better suppress covariance drift and enhance the network's ability to fit data, an instance regularization layer is added after the convolutional layers of the encoder and decoder to regularize the features, thereby improving the shadow removal effect.
[0044] When the image to be processed has a high resolution or a large amount of data, directly feeding the image to be processed into the trained neural network can cause memory overflow or excessive processing time, affecting the user experience. To solve this problem, conventional interpolation scaling algorithms can be used, but this can easily lead to the loss of image information, making it impossible to perfectly enlarge the generated image to the original image.
[0045] Considering that shadow areas typically lack significant gradient information, in an optional embodiment, an image pyramid algorithm can be used to first downsample the image to be processed, while simultaneously preserving the gradient information of each layer to form a Laplacian pyramid. The layer with the smallest pyramid size is then fed into a trained neural network to obtain the output image. Finally, the Laplacian pyramid is used to reconstruct the output image. Since the gradient information in shadow areas is very weak, the reconstruction process, even if it restores some gradient information from the original image, will not affect the shadow removal effect. By utilizing the gradient information of each layer preserved during downsampling for image reconstruction, shadow removal can be achieved without affecting image resolution. By introducing downsampling and image reconstruction, image processing speed is ensured while maintaining image quality before and after processing, which is beneficial for processing high-resolution images on devices with limited computing power, such as mobile phones.
[0046] like Figure 3 As shown, in order to obtain the trained neural network, the image processing method further includes:
[0047] S300: Construct the initial neural network;
[0048] S302: Train the initial neural network using sample data to obtain a trained neural network, wherein the sample data includes real-world images and synthetic shadow images, and the synthetic shadow images are synthesized from pure shadow images and shadowless images.
[0049] Because users frequently capture images with a wide variety of shadow types, shadows can be distinguished by their edges. These include sharp, clear shadow edges when the light source is close to the background, and blurred, gently sloping shadow edges when the light source is far from the background. Furthermore, shadows also appear in different colors depending on the light source (e.g., warm light with a reddish-yellow tint, cool light with a bluish tint, and sunlight). Therefore, considering these characteristics, the sample data used to train the initial neural network plays a crucial role in the entire image processing method. There are two main methods for acquiring sample data: real-world scene capture and image synthesis.
[0050] In the real-scene data collection method, the data collectors select the corresponding lighting environment and shooting object according to the scene category (e.g., different lighting scenes, warm light, cool light, sunlight, etc.), fix the shooting device such as mobile phone or camera with a tripod, adjust the appropriate lighting direction and focus, use the palm, mobile phone or other common objects as a blocking object to block the light, form a shadow on the shooting object and take a picture to obtain a shadow image, and then remove the blocking object and take a picture again to obtain a background image without shadow. In this way, paired sample data are obtained.
[0051] However, real-world data collection often fails to guarantee high-quality sample data. On the one hand, due to changes in light caused by occlusion, the background and shadow images will have differences in brightness and color in non-shadow areas, and the shadow image is difficult to align perfectly with the background image. On the other hand, due to changes in light or focus, noise will be generated in the shadow and background images, all of which will have a significant impact on the training of the network.
[0052] To address this, image synthesis methods can be used to generate realistic synthetic shadow maps for training neural networks.
[0053] In an optional embodiment, the image synthesis method includes:
[0054] S400: Obtain pure shadow map;
[0055] In one optional embodiment, the data collector lays a piece of white paper on a table under a preset lighting environment and uses his palm, mobile phone or other common objects to block the light, leaving a pure shadow image S on the white paper, wherein all or part of the pure shadow image S is a shadow area.
[0056] Because the non-shaded areas on a white paper may not appear pure white when acquiring a pure shadow image, the boundary between the non-shaded and shaded areas may not be clear. Therefore, in another optional embodiment, the pure shadow image can be transformed, for example, S' = min(a, S / mean(S)*a), where a is a positive integer. Through the above transformation, the pixel values of the non-shaded areas in the transformed pure shadow image can be uniformly set to a fixed value a (e.g., 255), while the pixel values of the shaded areas are values between 0 and a, resulting in a clearer boundary between the non-shaded and shaded areas in the pure shadow image.
[0057] S402: Obtain a shadowless image;
[0058] In one alternative embodiment, the data acquisition personnel take shadowless images B of various subjects under the same lighting conditions described above;
[0059] S404: Obtain a composite shadow map based on a pure shadow map and a shadowless map;
[0060] In one alternative embodiment, the pure shadow map S (or the transformed pure shadow map S') is multiplied pixel by pixel with the shadowless map B to obtain a composite shadow map.
[0061] This image synthesis method takes into account the weakening effect of shadows on light, and can better handle shadows with smooth edge transitions, resulting in a strong sense of realism.
[0062] Since the sample data is a mixture of real-world images and synthetic shadow images, the initial neural network also includes a module for classifying the sample data. When the sample data input to the initial neural network is determined to be a real-world image, the labeled data (Ground Truth, GT) is a shadow-free image captured from the real scene. Because the shadow region mask of the real-world image cannot be adjusted, the parameters of the second-level network 22 can be adjusted based on the difference between the shadow-free image output by the initial neural network and the shadow-free image used as labeled data GT. When the sample data input to the initial neural network is determined to be a synthetic shadow image, the labeled data (Ground Truth, GT) includes shadow-free images and pure shadow images captured from the real scene. The parameters of the first-level network 20 are adjusted based on the difference between the shadow region mask and the pure shadow image, and the parameters of the second-level network 22 are adjusted based on the difference between the shadow-free image output by the initial neural network and the shadow-free image used as labeled data. By using mixed data as sample data for training, accurate masks can be obtained for shadows with smooth transitions, ensuring the quality of mask segmentation and improving the effect of shadow removal.
[0063] In an optional embodiment, the method for acquiring sample data may further include performing one or more of the following processes on the acquired sample data: random flipping, rotation, color temperature adjustment, channel swapping, adding random noise, etc., to enrich the sample data and increase the robustness of the network.
[0064] In an optional embodiment, when supervising the training of the initial neural network, the loss function includes at least one of the following: pixel loss, feature loss, structural similarity loss, and adversarial loss.
[0065] The pixel loss function measures the similarity between two images at the pixel level, primarily consisting of pixel value loss and gradient loss. In this embodiment, it mainly refers to the weighted sum of the mean square error of pixel values and the L1 norm error of the gradients of the two images compared to the output image of the initial neural network and the label image. Pixel loss supervises the training process at the pixel level, ensuring that the pixel values of each pixel in the output image of the initial neural network and the label image are as close as possible. To guide the initial neural network to focus on the differences between the shadow layer and the background layer in the shadow region rather than the noise of the entire image, in an optional embodiment, a pixel truncation loss can be introduced. This truncates the pixel loss, calculating the loss of two pixels only when the absolute difference between two pixels is greater than a given threshold; otherwise, the difference between the two pixels is ignored. Adding pixel truncation loss guides the network to focus on the shadow region, suppressing image noise, thus enhancing the shadow removal effect and significantly accelerating the network's convergence speed.
[0066] Feature loss primarily refers to the weighted sum of the L1 norm errors of the corresponding features of the input and label images of the initial neural network. In one optional embodiment, a VGG19 network pre-trained on the ImageNet dataset is used as a feature extractor. The output image and label image of the initial neural network are fed into this feature extractor to obtain the features of each layer of VGG19. Then, the L1 norm errors of the corresponding features of the input and label images are calculated and summed with weights. The features of each layer of VGG19 are not sensitive to image details and noise, and have good semantic properties. Therefore, even if there are defects such as noise or misalignment in the input and output images, the feature loss can still accurately generate effective differences in shadow regions, making up for the lack of sensitivity of pixel loss to noise and exhibiting good stability.
[0067] The structural similarity loss function measures the similarity between two images based on their global features. In this embodiment, it mainly refers to the global difference in brightness and contrast between the output image and the label image of the initial neural network. Adding this loss function can effectively suppress color casts in the network output and improve the overall image quality.
[0068] Adversarial loss primarily refers to the loss value between the discriminator's output and the true category of the output image. In the later stages of training, when the difference between the initial neural network's output image and the label image becomes smaller, the effects of pixel loss, feature loss, and structural similarity loss gradually diminish, and network convergence slows down. At this point, a discriminator network is trained simultaneously to assist the network's training. First, the output image and label image of the initial neural network are fed into the discriminator. The discriminator determines whether the output image is the label image, calculates the loss based on the discriminator's output and the true category of the output image, and updates the discriminator parameters. Then, the discriminator's judgment of the output image is used as the loss of the output image's realism, and this loss is used to update the discriminator's parameters. When the discriminator can no longer distinguish between the initial neural network's output image and the label image, training ends. Adversarial loss can effectively eliminate image side effects caused by network processing (e.g., inconsistencies in color between shadow and non-shadow areas, shadow persistence, etc.), improving the realism of the network's output image.
[0069] Threshold truncation loss. Due to the influence of lighting, pairwise data collected from real-world scenes may exhibit slight differences in brightness and color in non-shaded areas. These differences are acceptable to users and do not require processing. Therefore, during training, to prevent the network's attention from focusing on these small global differences, this method introduces a threshold truncation loss. This means that the difference between the network's output and the ground truth (GT) is only included in the gradient calculation of the overall loss parameters if the difference is greater than a given threshold; otherwise, the loss is considered zero. This loss function tolerates small differences between the network's output and the GT, shifting the network's learning focus to areas with larger differences, thereby effectively improving the network's ability to eliminate more obvious shadows.
[0070] Shadow edge loss. First, the shadow region mask is dilated to obtain an inflated image; second, the shadow region mask is eroded to obtain an eroded image; then, the difference between the inflated and eroded images is used to obtain the boundary region between shadow and non-shadow areas, and TVLoss is used for smoothing, which can effectively transition between shadow and non-shadow areas.
[0071] The shadow brightness loss ensures that the brightness difference between the region corresponding to the shadow area in the deshaded image output by the neural network and the shadow area in the input image is greater than 0, thereby improving the brightness of the region corresponding to the shadow area in the deshaded image.
[0072] In an optional embodiment, the background layer output module of the initial neural network uses a weighted sum of all the above losses as the total loss, while employing the Wassertein generative adversarial network as the adversarial loss.
[0073] This network structure extracts global and local features from the input image, improving the degree of shadow removal while protecting non-shadow areas from side effects.
[0074] Figures 5(a) and 5(b) are comparison images of the processing effect achieved by the image processing method of the present invention. Figure 5(a) is the image to be processed containing shadows, and Figure 5(b) is the image after the image processing method has removed the shadows. The comparison of the two images shows that the image processing method provided by the present invention can effectively eliminate shadows without producing significant side effects on the background layer.
[0075] The neural network structure and loss function used in the embodiments of the present invention can also be applied to application scenarios such as shadow removal, rain removal, and fog removal. It is mainly used to process high-resolution images captured by mobile terminals such as mobile phones, but it is also applicable to processing images of various resolutions on PCs or other embedded devices.
[0076] According to another aspect of the present invention, an electronic device is also provided, including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the image processing method of any of the above-described embodiments by executing the executable instructions.
[0077] According to another aspect of the present invention, a storage medium is also provided, the storage medium including a stored program, wherein, when the program is running, it controls the device where the storage medium is located to execute the image processing method described above.
[0078] According to another aspect of the present invention, an image processing apparatus is also provided. (See reference) Figure 6 This is a structural block diagram of an optional image processing apparatus according to an embodiment of the present invention. Figure 6 As shown, the image processing device 60 includes an image acquisition unit 600 and a processing unit 602.
[0079] The various units included in the image processing apparatus 60 will now be described in detail.
[0080] The image acquisition unit 600 is used to acquire the image to be processed, which includes the shadow area.
[0081] The processing unit 602 is used to receive the image to be processed and process the image to be processed using a trained neural network to obtain a deshaded image. The neural network includes a first-level network and a second-level network cascaded together. The image to be processed and the output image of the first-level network are simultaneously input into the second-level network.
[0082] In one alternative embodiment, the neural network is structured as follows: Figure 2 As shown in the figure and the relevant descriptions in this article, they will not be repeated here.
[0083] The sequence numbers of the above embodiments of the present invention are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.
[0084] In the above embodiments of the present invention, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0085] In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are merely illustrative; for example, the division of units can be a logical functional division, and in actual implementation, there may be other division methods. For instance, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the displayed or discussed mutual coupling, direct coupling, or communication connection may be through some interfaces; the indirect coupling or communication connection between units or modules may be electrical or other forms.
[0086] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0087] Furthermore, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0088] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks.
[0089] The above description is only a preferred embodiment of the present invention. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.
Claims
1. An image processing method, comprising: Obtain the image to be processed that contains shaded areas; The image to be processed is input into a trained neural network to obtain a shadow-removed image; wherein the neural network comprises a first-level network and a second-level network cascaded together. The first-level network receives the image to be processed and outputs a shadow region mask map, and the second-level network simultaneously receives the image to be processed and the shadow region mask map, and outputs the shadow-removed image. The method further includes: constructing an initial neural network; The initial neural network is trained using sample data to obtain the trained neural network, wherein the sample data includes real-world images and synthetic shadow images, and the synthetic shadow images are synthesized using an image synthesis method from pure shadow images and images without shadows; The composite shadow image is synthesized from a pure shadow image and a shadowless image using an image compositing method, including: Obtaining a pure shadow map includes: under a preset lighting environment, using an object to block the light from a flat white paper to obtain a pure shadow map left on the white paper, wherein all or part of the pure shadow map is a shadow area; Obtain a shadowless image; The composite shadow map is obtained based on the pure shadow map and the shadowless map; The initial neural network also includes a module for classifying sample data. When it is determined that the sample data input to the initial neural network is a real-world image, the labeled data is a shadow-free image captured from the real scene. The parameters inside the second-level network are adjusted based on the difference between the shadow-free image output by the initial neural network and the shadow-free image used as labeled data. When it is determined that the sample data input to the initial neural network is a synthetic shadow image, the labeled data includes the shadowless image and the pure shadow image captured from the real scene. The parameters inside the first-level network are adjusted based on the difference between the shadow region mask image and the pure shadow image. The parameters inside the second-level network are adjusted based on the difference between the shadow-free image output by the initial neural network and the shadowless image.
2. The image processing method of claim 1, wherein, The first-level network includes: The first feature extraction module includes a first encoder, which is used to extract features of the image to be processed layer by layer to obtain a first set of feature data. The shadow region estimation module is connected to the output of the first feature extraction module and includes a first decoder for estimating the shadow region based on the first set of feature data and outputting a shadow region mask map.
3. The image processing method according to claim 1, characterized in that, The second-level network includes: The second feature extraction module includes a second encoder, which is connected to the output of the first-level network. While receiving the image to be processed, it also receives the shadow region mask map output by the first-level network to obtain the second set of feature data. The result image output module is connected to the output of the second feature extraction module and includes a second decoder for outputting the deshaded image based on the second set of feature data.
4. The image processing method according to claim 2 or 3, characterized in that, The outputs of each layer of the first decoder or the second decoder are spliced along the channel axis with the outputs of the corresponding layers of the first encoder or the second encoder through cross-layer connections. A multi-scale pyramid pooling module is added to the cross-layer connections of the first decoder or the second decoder and the first encoder or the second encoder. The multi-scale pyramid pooling module fuses features of different scales.
5. The image processing method according to claim 1, characterized in that, After acquiring the image to be processed, which includes the shadowed area, the image processing method further includes: The image to be processed is downsampled using an image pyramid algorithm, and the gradient information of each layer is saved during downsampling to form a Laplacian pyramid; The smallest layer is fed into a trained neural network to obtain the output image; The output image is reconstructed from low resolution to high resolution using the Laplacian pyramid to obtain the deshaded image.
6. The image processing method according to claim 1, characterized in that, The method of using image synthesis to synthesize the synthesized shadow image from a pure shadow image and a shadowless image further includes: transforming the pure shadow image, and obtaining the synthesized shadow image based on the transformed pure shadow image and the shadowless image, wherein the pixel values of the non-shadow areas in the transformed pure shadow image are uniformly set to a fixed value 'a', and the pixel values of the shadow areas are values between 0 and 'a', where 'a' is a positive integer.
7. The image processing method according to claim 1, characterized in that, When training the initial neural network using sample data, the loss function includes at least one of the following: pixel loss, feature loss, structural similarity loss, adversarial loss, shadow edge loss, and shadow brightness loss.
8. The image processing method according to claim 7, characterized in that, The pixel loss includes pixel truncation loss. When the absolute difference between two corresponding pixels in the output image of the initial neural network and the label image is greater than a given threshold, the loss of the two pixels is calculated; when the absolute difference between two corresponding pixels in the output image of the initial neural network and the label image is not greater than the given threshold, the difference of the two pixels is ignored.
9. The image processing method according to claim 7, characterized in that, The shadow brightness loss makes the difference between the brightness of the region corresponding to the shadow area in the de-shadowed image output by the neural network and the brightness of the shadow area in the input image to be processed greater than 0, thereby increasing the brightness of the region corresponding to the shadow area in the de-shadowed image.
10. The image processing method according to claim 7, characterized in that, When the loss function includes the shadow edge loss, the image processing method includes: performing dilation processing on the shadow region mask to obtain an dilated map; performing erosion processing on the shadow region mask to obtain an eroded map; obtaining the difference between the dilated map and the eroded map as the boundary region between the shadow and the non-shadow, and smoothing it using TVLoss.
11. An image processing apparatus, comprising: An image acquisition unit is used to acquire images to be processed that contain shadowed areas; A processing unit is configured to receive an image to be processed and process the image to be processed using a trained neural network to obtain a shadow-removed image; wherein the neural network comprises a first-level network and a second-level network cascaded together, the first-level network receiving the image to be processed and outputting a shadow region mask map, and the second-level network simultaneously receiving the image to be processed and the shadow region mask map, and outputting the shadow-removed image; A training module is used to construct an initial neural network; the initial neural network is trained using sample data to obtain the trained neural network, wherein the sample data includes real-world images and synthetic shadow images, and the synthetic shadow images are synthesized using an image synthesis method from pure shadow images and shadowless images; The composite shadow image is synthesized from a pure shadow image and a shadowless image using an image compositing method, including: Obtaining a pure shadow map includes: under a preset lighting environment, using an object to block the light from a flat white paper to obtain a pure shadow map left on the white paper, wherein all or part of the pure shadow map is a shadow area; Obtain a shadowless image; The composite shadow map is obtained based on the pure shadow map and the shadowless map; The initial neural network also includes a module for classifying sample data. When it is determined that the sample data input to the initial neural network is a real-world image, the labeled data is a shadow-free image captured from the real scene. The parameters inside the second-level network are adjusted based on the difference between the shadow-free image output by the initial neural network and the shadow-free image used as labeled data. When it is determined that the sample data input to the initial neural network is a synthetic shadow image, the labeled data includes the shadowless image and the pure shadow image captured from the real scene. The parameters inside the first-level network are adjusted based on the difference between the shadow region mask image and the pure shadow image. The parameters inside the second-level network are adjusted based on the difference between the shadow-free image output by the initial neural network and the shadowless image.
12. The image processing apparatus according to claim 11, characterized in that, The first-level network includes: The first feature extraction module includes a first encoder, which is used to extract features of the image to be processed layer by layer to obtain a first set of feature data. The shadow region estimation module is connected to the output of the first feature extraction module and includes a first decoder for estimating the shadow region based on the first set of feature data and outputting a shadow region mask map.
13. The image processing apparatus according to claim 11, characterized in that, The second-level network includes: The second feature extraction module includes a second encoder, which is connected to the output of the first-level network. While receiving the image to be processed, it also receives the shadow region mask map output by the first-level network to obtain the second set of feature data. The result image output module is connected to the output of the second feature extraction module and includes a second decoder for outputting a deshaded image based on the second set of feature data.
14. A storage medium, characterized in that, The storage medium includes a stored program, wherein, when the program is executed, it controls the device containing the storage medium to perform the image processing method according to any one of claims 1 to 10.
15. An electronic device, characterized in that, include: processor; as well as Memory for storing the executable instructions of the processor; The processor is configured to execute the image processing method of any one of claims 1 to 10 by executing the executable instructions.