A semantic-aware image shadow detection method

The shadow detection network built using the Swing Transformer combines shadow shape semantics and multi-task learning to solve the problem of incomplete detection in ambiguous cases by CNNs, achieving efficient and accurate shadow detection.

CN115661505BActive Publication Date: 2026-06-12HANGZHOU DIANZI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HANGZHOU DIANZI UNIV
Filing Date
2022-09-07
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing CNN-based shadow detection methods perform poorly in ambiguous cases, struggle to adapt to variations in the shape, size, and texture of shadow regions, and lack semantic interaction information, resulting in incomplete detection and a high false positive rate.

Method used

A shadow detection network is constructed using the Swing Transformer. Combining shadow shape semantics, a multi-task learning framework is used for deep supervision. Low-level features are used to detect details and high-level features are used to distinguish shadow categories. Multi-scale prediction maps are fused for detection.

🎯Benefits of technology

It improves the accuracy and efficiency of shadow detection, effectively overcomes ambiguous cases, achieves complete and fine-grained detection of shadow areas, and enhances detection performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115661505B_ABST
    Figure CN115661505B_ABST
Patent Text Reader

Abstract

The application discloses a semantic-aware image shadow detection method, which takes a shadow image as input and performs end-to-end shadow mask prediction. It includes three parts: constructing a shadow detection network, making a semantic annotation set, and realizing multi-task learning. Specifically, a shadow detection network based on Swin Transformer is constructed to learn global and long-range information interaction, and a shadow multi-scale prediction map is fused to ensure the completeness and fine granularity of the detection result. Then, the shadow image GT is semantically annotated using a public dataset to obtain semantic labels. Finally, a multi-task learning framework combining shadow supervision and semantic supervision is designed, which skillfully utilizes multi-scale feature information of the image to robustly learn shadow knowledge. After training, an efficient shadow detection network with a parameter size of 24.37M is obtained, which can effectively avoid the interference of ambiguous areas and overcome the limitations of existing shadow detection methods.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of target detection technology, specifically relating to a semantically perceptive image shadow detection method. Background Technology

[0002] Shadows are common in real-world scenes, created by objects (such as people, animals, and buildings) obscuring light sources. In some visual scenarios, shadows can provide valuable clues for scene understanding, such as light source direction, object geometry, and camera parameters. However, in some visual tasks, the presence of shadows degrades model performance, necessitating early detection and removal. For example, shadow detection and removal in text and remote sensing images can enhance readability and recognizability. Furthermore, in other tasks such as image segmentation, object detection, and visual tracking, the presence of shadows can easily cause ambiguity, potentially leading to misidentification as objects. Therefore, accurate shadow detection is crucial for ensuring the accuracy of downstream visual tasks.

[0003] Traditional shadow detection methods are mainly based on handcrafted features, such as lighting, color, and texture, to build physical or machine learning models for shadow detection. These methods often suffer from performance degradation in real-world scenarios because handcrafted features lack sufficient discriminative power. In recent years, Convolutional Neural Networks (CNNs) have been successfully applied to various visual tasks due to their powerful feature representation capabilities. Currently, CNN-based shadow detection methods have become the mainstream in this field, achieving significant performance improvements. They typically employ two strategies: combining contextual information or expanding the training data. Analysis of the detection results of these methods on the public datasets ISTD and SBU reveals that most of the misdetected samples are ambiguous cases: (1) Shadow-like regions are similar in color to shadows and are often misclassified as shadows; (2) Shadow regions contain some heterogeneous backgrounds, forming relatively bright areas that weaken the shadow color, resulting in incomplete shadow detection results.

[0004] Some recent methods, such as MTMT-Net and FSDNet, have attempted to improve model performance by using additional training data. However, these methods are still affected by the aforementioned ambiguous cases because their models treat all detection cases equally. There are two possible reasons for the ambiguity: (1) The essence of shadow detection is to perform binary classification of pixels, while the shadow label (Ground Truth, GT) is only presented in the form of a shadow mask, lacking more prior knowledge about shadows, such as the shape category of the occluder, and therefore cannot adapt to ambiguous scenarios; (2) Since the spatial information extracted by convolution operations lacks semantic interaction, CNN-based shadow detection methods have significant limitations in long-range dependency modeling. Therefore, when the shape, size, or texture of the shadow region changes significantly, these methods usually exhibit weak performance. Summary of the Invention

[0005] To address the shortcomings of existing technologies, this invention proposes a semantically aware image shadow detection method that combines shadow shape semantics to overcome the influence of ambiguous regions and improve the accuracy and efficiency of image shadow detection.

[0006] A semantically aware image shadow detection method specifically includes the following steps:

[0007] Step 1: Construct a shadow detection network based on Swing Transformer.

[0008] The shadow detection network has an end-to-end architecture, including an encoder and a decoder.

[0009] Step 1.1: Build the encoder

[0010] Using the Swing Transformer as the backbone, a four-layer network is constructed. Each layer uses two consecutive Swing Transformer Blocks to build a hierarchical feature map of the input image. Then, by adjusting the parameters, the resolution of each layer is increased sequentially. The encoder is obtained by analyzing the characteristics of the encoder.

[0011] Step 1.2: Build the decoder

[0012] After each layer of the encoder, the multi-scale prediction maps obtained from the side are shared and connected by two consecutive Res-conv and a 1×1 convolution to obtain the decoder.

[0013] Step 2: Perform semantic annotation on the ground truth (GT) of the shadow image.

[0014] First, the shadows in the images are classified into different categories based on the shapes of the occluders in the dataset. Then, different colors are used to represent these shadow categories, and corresponding color masks are added to the ground plane to obtain a semantic label set.

[0015] Step 3: Deep Supervision Learning

[0016] A multi-task learning framework is constructed in the decoder to perform multi-task supervision on shadow feature maps of different scales obtained by the encoder, so as to obtain multi-scale shadow prediction maps, including shadow maps and semantic shadow maps.

[0017] Step 3.1, Shadow Supervision.

[0018] Low-level features contain image details and help detect fine shadows and shadow boundaries. Therefore, we use ground truth (GT) to supervise the shadow region of the feature maps generated by the first three layers of the encoder network, and generate detailed multi-scale shadow maps through single-channel 1×1 convolution.

[0019] Step 3.2, Semantic Supervision

[0020] High-level features contain image semantic information, which helps to distinguish between shadows and background, and further distinguish shadow categories. Therefore, semantic labels are used to perform semantic supervision on the semantic shadow map generated by the fourth layer network of the encoder, and the semantic shadow map is generated by a 1×1 convolution with K channels.

[0021] Step 3.3, Fusion Detection

[0022] After compressing and upsampling the multi-scale shadow map obtained in step 3.1 and the semantic shadow map obtained in step 3.2 to restore them to the same resolution, they are shared and connected. Semantic labels are used for supervision to obtain a fused semantic shadow map. Binarization is then performed to output the final shadow detection result.

[0023] The present invention has the following beneficial effects:

[0024] 1. The shadow detection network based on the Swin Transformer overcomes the limitations of CNNs, effectively learning global and long-range semantic information interactions. During detection, multi-scale shadow prediction maps are fused, resulting in more complete and granular detection results. Therefore, this method still exhibits good performance even when the shape, size, and texture of the shadow region change significantly. Furthermore, thanks to the relatively low computational complexity of the Swin Transformer, this method achieves efficient shadow detection.

[0025] 2. The multi-task learning strategy that combines shadow shape semantic design overcomes the limitations of traditional GT-based training, enabling shadow detection to have semantic awareness. For ambiguous cases that are difficult to detect accurately with existing technologies, this method has significant advantages. In detection, it can effectively overcome the ambiguous effects of "non-shadow areas resembling shadows" and "shadow areas with non-shadow patterns", thereby significantly improving detection performance.

[0026] 3. A multi-task learning framework based on deep supervision design: its top layer learns category-related semantic information to overcome ambiguity interference, while the bottom layer learns category-independent shadow information to supplement details for the top layer. Sharing the prediction graphs connecting the bottom and top layers yields a more complete and fine-grained detection result. To coordinate different learning tasks, the framework also embeds four information buffer units, resolving the problem of network gradient signal conflicts caused by different supervision tasks. Attached Figure Description

[0027] Figure 1 Flowchart of a semantically perceptive image shadow detection method;

[0028] Figure 2 This is a schematic diagram of the shadow detection network based on the Swing Transformer in the embodiment;

[0029] Figure 3 This is a schematic diagram of the semantic annotation of the shadow GT in the embodiment;

[0030] Figure 4 The results of semantic tag set analysis in the example are shown, where a and b are the shadow categories and their proportion distribution statistics of the two tag sets, and c and d are the interdependencies between different categories in the two tag sets.

[0031] Figure 5 This is a schematic diagram of the shared connection of the multi-task learning framework in the embodiment;

[0032] Figure 6 This is a schematic diagram of the shadow detection results in the embodiment. Detailed Implementation

[0033] The present invention will be further explained below with reference to the accompanying drawings;

[0034] like Figure 1 As shown, a semantically aware image shadow detection method uses a shadow image as input to perform end-to-end prediction of shadow detection results. Specifically, it includes the following steps:

[0035] Step 1: Construct a shadow detection network based on Swing Transformer.

[0036] like Figure 2As shown, the shadow detection network has an end-to-end architecture, including an encoder and a decoder.

[0037] Step 1.1: Build the encoder

[0038] Using the Swing Transformer as the backbone, a four-layer network is constructed, with each layer using two consecutive Swing Transformer Blocks to build a hierarchical feature map. The resolution of the features in each layer is then adjusted sequentially. Obtain the encoder.

[0039] In the encoder, the input shadow image I∈R 256×256×3 First, the image I is segmented into multiple non-overlapping patches by a patch partitioning layer. In this example, the size of the segmented patches is set to 2×2, resulting in a feature dimension of 2×2×3=12. After passing through the patch partitioning layer, image I is converted into an embedded sequence. Then, a hierarchical feature map is constructed in four stages using a four-layer encoder network. In the first stage, the feature dimension is transformed using a linear embedding layer, followed by representation learning using two consecutive Swin Transformer blocks (STB×2). In the second to fourth stages, downsampling is performed first using a patch merging layer, followed by feature transformation using STB×2. In STB×2, the first SwingTransformer module uses a window-based multi-head self-attention (W-MSA) module, which performs self-attention calculation within the region after dividing the tile into non-overlapping regions; the second SwingTransformer module uses a shifted window-based multi-head self-attention (SW-MSA) module, which realizes information interaction between windows.

[0040] Step 1.2: Build the decoder

[0041] To improve detection efficiency, this application abandons the Swin-Unet decoder structure and instead utilizes the prediction results output from each stage of the encoder. Specifically, an information buffer unit (IB) consisting of two Res-conv is connected after each side of the encoder, and then a 1×1 convolution is used to obtain the shadow multi-scale prediction map.

[0042] Step 2: Perform semantic annotation on the ground truth (GT) of the shadow image.

[0043] This embodiment uses the publicly available datasets ISTD and SBU to create a semantic label set, such as Figure 3 As shown, shadows are first categorized into different types based on the occlusion types in the ISTD and SBU datasets, such as Person, Animal, Umbrella, Board, and Building. Then, different colored masks are used as semantic masks to distinguish shadow categories, that is, semantic masks are added to all ground truths to obtain semantic label sets Sem-ISTD and Sem-SBU.

[0044] In this embodiment, the rule for labeling GT is as follows:

[0045] ① If an image contains multiple shadow categories and there are shadow masks of different types connected together, the boundaries of the mask are defined based on the prior knowledge of the occluder.

[0046] ② For shadow categories with the same shape but different sizes, such as rectangular occluders of different sizes in the ISTD dataset, they are classified into the same category because their shadow shapes are similar.

[0047] ③ Group the shadows of occluders with similar shapes into the same category. For example, motorcycles and bicycles in the SBU dataset are classified as "cycle".

[0048] The final Sem-ISTD and Sem-SBU results contain 5 and 9 shadow categories, respectively. Further analysis of Sem-ISTD and Sem-SBU is conducted, such as... Figure 4 As shown, figures a and b respectively list the ratio distribution of each shadow category in Sem-ISTD and Sem-SBU, where the ratio represents the proportion of images containing the same category to the total number of images in the dataset. Figures c and d show the dependencies between shadow categories in Sem-ISTD and Sem-SBU, respectively. Figure 4 It can be seen that Sem-SBU has more shadow categories than Sem-ISTD; Sem-SBU has more complex category dependencies than Sem-ISTD.

[0049] Step 3: Deep Supervision Learning

[0050] In the decoder, a multi-task learning framework is constructed based on deep supervision to supervise the shadow feature maps of different scales output by the encoder in multiple tasks. This combines shadow supervision and semantic supervision to make full use of the low-level and high-level image features extracted by the network.

[0051] Step 3.1, Shadow Supervision.

[0052] The shadow prediction maps obtained by passing the outputs of the first to third layers of the encoder through the information buffer unit are used to generate shadow maps of different scales using a single-channel 1×1 convolution {S}. 1 ,S 2 ,S 3} = S. Using the shadow label GT Y = {y i The shadow region supervision is performed on the feature maps generated by the first three layers of the encoder network: i = 1, 2, ..., |I|}. The shadow supervision loss designed based on cross-entropy is:

[0053]

[0054] Where W represents all network parameters, and m = 1, 2, 3 represents the encoder side-end sequence number. P(·) represents the activation function value at pixel i, and P(·) represents the activation function Sigmoid.

[0055] Step 3.2, Semantic Supervision

[0056] The semantic shadow map is generated by using a K-channel 1×1 convolution on the shadow prediction map obtained from the output of the fourth layer of the encoder through the information buffer unit. Where K represents the number of shadow categories. Using semantic tags {C} 1 C 2 ,…,C K Semantic supervision is performed on the semantic shadow map generated by the fourth layer of the encoder, where, Let k represent the k-th class of shadow map. The corresponding semantic supervision loss is:

[0057]

[0058] in, This represents the activation function value at pixel i that belongs to the k-th class.

[0059] Step 3.3, as follows Figure 5 As shown, the multi-scale shadow map S obtained in step 3.1 is compared with the semantic shadow map A obtained in step 3.2. 4 Each channel is shared concatenated (SC) to obtain a stacked shaded activation map S. f :

[0060]

[0061] Then use K 1×1 convolutions to transform S fThe semantic shadow map is fused into a K-channel semantic shadow map. For the fused semantic shadow map, the semantic supervision loss is set as follows:

[0062]

[0063] Among them, S f It is the stacked shadow activation map in equation (3).

[0064] Binarizing the semantic shadow map yields the shadow mask, which is the final detection result. The shadow supervision loss and semantic supervision loss are combined, and the final supervision loss is set as follows:

[0065]

[0066] The network was trained for 40 and 60 iterations on the ISTD and SBU datasets, respectively. Data augmentation was performed using random horizontal flipping, color jittering, and blurring to increase data diversity. Stochastic Gradient Descent (SGD) was used to optimize all network parameters. The batch size was set to 16, the learning rate was set to 0.001, and the momentum decay and weight decay were set to 0.9 and 1e-4, respectively. The final network parameter size was 24.37M.

[0067] like Figure 6 As shown, this method can effectively detect the two ambiguous cases mentioned in the background technology.

[0068] To verify the effectiveness of this method and compare its performance, this embodiment implements the network model using PyTorch 1.7.0 and Python 3.6, and trains the network model on a GeForce RTX 3090 GPU with 24GB of memory. Using three publicly available datasets—ISTD, SBU, and UCF—it is compared with seven shadow detection methods: ScGAN, DSC, A+D Net, BDRAR, DSDNet, MTMT-Net, and FSDNet, with the Balanced Error Rate (BER) used as the evaluation metric.

[0069]

[0070] Here, TP, TN, P, and N represent the number of positive example pairs, negative example pairs, and the number of pixels in shadow and non-shadow categories, respectively. In the experiment, the lower the BER value, the better the shadow detection performance.

[0071] The ISTD dataset contains 1870 shadow images, of which 1330 are used as the training set and 540 as the test set. It includes both shadow ground truth (GT) and non-shadow images with corresponding labels; this embodiment uses only the shadow GT. The SBU dataset contains 4727 pairs of shadow image / shadow GT, of which 4089 pairs are used as the training set and 638 pairs as the test set. The UCF dataset contains 110 images with a style similar to SBU; this embodiment uses these as the test set. During the experiment, the model was first trained on the SBU training set, and then tested on both the SBU and UCF test sets. For the semantic supervision task, the semantic label sets Sem-ISTD and Sem-SBU constructed in step 2 were used.

[0072] The results of the shadow detection experiment are shown in Table 1, where "FPS" represents the number of frames detected per second, "Para" represents the model parameter size, and "S" and "NS" represent the pixel error rates in shadow and non-shadow regions, respectively. "This method-" indicates that semantic supervision is not used, but deep supervision is employed.

[0073]

[0074] Table 1

[0075] It can be observed that our method achieves the best detection performance on all three datasets. DSDNet is a CNN-based network model specifically designed for ambiguous cases. However, in practical detection, this method performs poorly when the shadow color is similar to the background, especially when these two similar regions are connected, because CNNs struggle to capture global and long-range semantic information interactions. Compared to DSDNet, our method, based on a Swin Transformer-based detection network, effectively addresses this issue. Both MTMT-Net and our method improve detection performance through multi-task learning. Compared to MTMT-Net, our method, by incorporating semantic supervision, reduces BER values ​​by 11.05%, 4.13%, and 3.88% on the ISTD, SBU, and UCF datasets, respectively. Our method achieves performance comparable to MTMT-Net through deep supervised Swin Transformer and multi-scale prediction fusion. Among all methods, FSDNet has the fewest model parameters but sacrifices inference accuracy. Although our method has more parameters than FSDNet, it still achieves efficient shadow detection at 76.23 FPS. Furthermore, the performance of this method on the UCF dataset demonstrates that the robust shadow detection network and multi-task learning strategy can be well generalized to new shadow scenes.

Claims

1. A semantically aware image shadow detection method, characterized in that: Specifically, the following steps are included: Step 1: Construct a shadow detection network based on Swing Transformer; Step 1.1: Build the encoder Using the Swing Transformer as the backbone, a 4-layer network is constructed, with each layer using two consecutive Swing Transformer Blocks. The resolution of the features in each layer is adjusted sequentially as follows: Obtain the encoder; Step 1.2: Build the decoder Two consecutive Res-conv and a 1×1 convolution are connected after each side of the encoder. The multi-scale prediction maps obtained from the side are shared and connected to obtain the decoder. Step 2: Perform semantic annotation on the ground truth (GT) of the shadow image; First, the shadows in the images are classified into different categories based on the shape of the occluders in the dataset. Then, different colors are used to represent these shadow categories, and corresponding color masks are added to the ground plane to obtain a semantic label set. Step 3: Deep Supervision Learning Step 3.1, Shadow Supervision The ground truth (GT) is used to supervise the shadow region of the feature map generated by the first three layers of the decoder network, and multi-scale shadow maps are generated by single-channel 1×1 convolution. Step 3.2, Semantic Supervision Semantic supervision is performed on the semantic shadow map generated by the fourth layer of the decoder using semantic labels, and the semantic shadow map is generated by K-channel 1×1 convolution; Step 3.3, Fusion Detection After compressing and upsampling the multi-scale shadow map obtained in step 3.1 and the semantic shadow map obtained in step 3.2 to restore them to the same resolution, they are shared and connected. Semantic labels are used for supervision to obtain a fused semantic shadow map. Binarization is then performed to output the final shadow detection result.

2. The semantically perceptive image shadow detection method as described in claim 1, characterized in that: In the encoder, the input shadow image is first segmented into multiple non-overlapping patches by a patch segmentation layer, and then a hierarchical feature map is constructed in four stages through a four-layer network of the encoder. In the first stage, the feature dimension is transformed through a linear embedding layer, and then representation learning is performed through two consecutive Swin Transformer modules. In the second to fourth stages, downsampling is performed through a patch merging layer, and then feature transformation is performed through two consecutive Swin Transformer modules. In the two consecutive Swin Transformer modules of each layer, the first Swin Transformer module adopts a window-based multi-head self-attention module, which performs self-attention calculation within the region after dividing the patch into non-overlapping regions; the second Swin Transformer module adopts a moving window-based multi-head self-attention module to realize information interaction between windows.

3. The semantically perceptive image shadow detection method as described in claim 1, characterized in that: Step 2 uses the public datasets ISTD and SBU to create a semantic label set, and sets the following annotation rules: ① If an image contains multiple shadow categories and there are shadow masks of different types connected together, the boundaries of the mask are defined based on the prior knowledge of the occluder; ② Group shadows of the same shape but different sizes into the same category; ③ Group shadows cast by objects with similar shapes into the same category.

4. The semantically perceptive image shadow detection method as described in claim 1, characterized in that: The network's parameters were optimized using stochastic gradient descent with a batch size of 16, a learning rate of 0.001, and momentum decay and weight decay of 0.9 and 1e-4, respectively.

5. The semantically perceptive image shadow detection method as described in claim 1, characterized in that: The shadow supervision loss in step 3 is: Where W represents all network parameters, and m = 1, 2, 3 represents the encoder side-end sequence number. S represents the activation function value at pixel i, where S = {S 1 ,S 2 ,S 3 } represents the shadow map generated by the encoder from the first to the third layers, P(·) represents the activation function Sigmoid, Y = {y i :i=1,2,…,|I|} represents the shaded label GT; The semantic supervision loss is: in, This represents the activation function value at pixel i, and it belongs to the k-th class. This represents the semantic shadow map generated by the fourth layer of the encoder. The semantic label representing the k-th type of shadow image; For the fused semantic shadow map, the semantic supervision loss is set as follows: in, Represents a stacked shaded activation map; The loss for joint shadow supervision and semantic supervision is: