Weakly supervised semantic segmentation method based on shape block semantic correlation degree
By combining shape block pooling and graph convolutional networks, the problem that class activation graphs cannot cover target objects is solved, achieving efficient semantic segmentation under weak supervision and improving segmentation accuracy and speed.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NANKAI UNIV
- Filing Date
- 2024-01-23
- Publication Date
- 2026-06-23
AI Technical Summary
In semantic segmentation under weak supervision, existing technologies cannot fully cover the target object with class activation maps, making it difficult to train semantic association networks. Furthermore, existing methods are computationally slow and difficult to form an end-to-end framework.
Semantic relevance is calculated using a shape block-based approach. Shape block pooling is used to generate more complete activation regions, and loss constraints based on low-level semantic similarity and shape block class consistency are introduced. Semantic classification is then performed using a graph convolutional network to generate pseudo-labels.
It improves the accuracy and speed of semantic segmentation, enabling precise semantic segmentation even in the absence of pixel-level annotations, and is suitable for image-level weakly supervised labeled data.
Smart Images

Figure CN118135209B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of computer data processing, and more specifically, relates to a weakly supervised semantic segmentation method based on the semantic correlation degree of shape blocks. Background Technology
[0002] Semantic segmentation is one of the most fundamental visual tasks and plays a crucial role in the underlying perception module of autonomous driving. Essentially, it's a dense prediction problem, aiming to classify all pixels in an image, accurately locate object regions, and eliminate background interference with pixel-level precision. However, the pixel-by-pixel labeled training data required to train semantic segmentation networks is far more expensive to obtain than data from classification and detection. Using more readily available weakly supervised information has become a research hotspot. Weak supervision can be multi-level supervision information, such as category labels, bounding box labels, and subset labels of labeled pixels. The main contribution of this invention is to complete the task of weakly supervised semantic segmentation with only image-level category labels.
[0003] Current methods for semantic segmentation using image-level supervision train a classification model and use pixel-level supervision for the output mask. The gap in supervision information makes the two-stage paradigm the mainstream approach. The first stage uses image-level labels to train a classification model to obtain a Class Activation Map (CAM). The CAM is then refined to generate dense pseudo-labels for each image. The second stage trains a semantic segmentation model based on the pseudo-labels.
[0004] Image-level supervised semantic segmentation typically uses a masking matrix (CAM) as a seed region to iteratively recover the object's mask. Existing methods either optimize to generate a more complete CAM as the seed region or refine the seed region to generate more reliable pseudo-labels. However, CAMs have two limitations: incompleteness and redundancy. They cannot cover the entire region of the target category, and CAMs may overlap with regions of other object categories.
[0005] Therefore, some works utilize prior knowledge of cross-pixel similarity to guide seed region expansion, ensuring expansion to regions with similar semantics. At CVPR 2018, to refine seed region expansion, a deep seedregion area growing method was proposed. This method trains the semantic segmentation network starting from the discriminative region, gradually increasing pixel-level supervision of seed region expansion, and dynamically generating new labels using in-image context information during training. At CVRP 2019, to optimize CAM for generating better seed regions, AffinityNet was proposed. This network is used to determine the semantic similarity between adjacent pixel pairs, initializes the seed region using the CAM results, and iteratively updates the seed region through random walks using the semantic similarity matrix, resulting in more refined object contours.
[0006] These methods use CAM to obtain highly activated regions for seed region refinement, restoring accurate image boundaries. Addressing the current methods' slow semantic computation speed, difficulties in practical application deployment, and challenges in forming an end-to-end framework, this invention proposes a shape-block-based weakly supervised semantic segmentation method. Summary of the Invention
[0007] This invention addresses the problem that activation regions in the class activation maps generated by classification networks cannot completely cover the target object, leading to difficulties in training semantic correlation networks. The CAM generation method based on shape block pooling generates more complete and accurate activation regions within the CAM. Furthermore, semantic correlation calculation based on shape blocks increases long-distance dependencies and captures broader contextual information. The semantic correlation network is trained by introducing loss constraints based on low-level semantic similarity and shape block class consistency. The correlation matrix output by the trained network is used as an adjacency matrix in a graph convolutional network. Finally, a graph convolutional network is used to classify graph structure data with shape blocks as nodes. The category of the shape block containing each node becomes the category label for all pixels within that shape block, ultimately achieving semantic segmentation. The shape block-based image representation method is interpretable and therefore highly scalable. The same architecture can be used in tasks such as image understanding and semantic analysis, and the overall accuracy can be improved by using more advanced network modules.
[0008] To achieve the above objectives, the present invention provides the following technical solution:
[0009] A weakly supervised semantic segmentation method based on shape block semantic association includes the following steps:
[0010] S1. Input the original image into the classification network to obtain the class activation map;
[0011] S2. Obtain graph structure data with shape blocks as nodes from the original image through the shape segmentation module;
[0012] S3. Use the shape block partitioning results in S2 to perform shape block pooling on the class activation graph to obtain a pooled class activation graph;
[0013] S4. Use the confidence regions in the pooling activation graph to train the semantic association network;
[0014] S5. Graph convolutional networks use the adjacency matrix output by the semantic association degree network to perform semantic classification of graph nodes and aggregate them into pseudo-labels.
[0015] S6. A supervised semantic segmentation network is trained using pseudo-labels. This network receives the original image and outputs the predicted semantic segmentation results.
[0016] In a further optimization of this technical solution, step S1 uses a classification network to obtain a class activation map for generating confidence regions. The classification network uses ResNet50 as the backbone network and is trained using the PASCAL VOC2012 dataset. The training specifically includes inputting the original image into the classification network, the network output being a probability vector of the category, calculating the loss by combining it with the one-hot encoding of the real image category label using the cross-entropy loss function, and completing the training using gradient backpropagation.
[0017] In a further optimization of this technical solution, step S2 uses a fully convolutional network to divide the input image into shape blocks and construct graph structure data with shapes as nodes. Specifically, this includes dividing the input image into shape blocks and using the shape blocks as nodes.
[0018] First, the input image is segmented into shape blocks. The color image is converted into a 5-dimensional feature vector in the CIELAB color space and XY position coordinates. A fully convolutional network is used to select the 5-dimensional feature vector to construct a distance metric, considering both LAB color and XY position information when calculating the distance. By selecting weight hyperparameters, the proportion of color and distance can be adjusted to optimize the generation of shape blocks and make them more regular. Specifically, the fully convolutional network uses an encoder-decoder framework. The encoder is used for image feature extraction, and the decoder is used for shape block segmentation. The loss function during network training consists of two parts: one part groups pixels with similar color and position attributes together, and the other part restricts the compactness of shape block generation, that is, it limits the search range when searching for pixels to be added to shape blocks during training.
[0019] Secondly, shape blocks are used as nodes. After dividing the original image into shape blocks in the previous step, each image yields graph structure data. The node information includes category labels, location information, and RGB information. The edges between nodes are represented by an adjacency matrix to show the degree of association between nodes. The graph structure data is used by the graph convolutional network to perform semantic classification of shape block categories. The steps include: graph structure data generation, graph convolutional network construction, and graph node semantic classification.
[0020] In a further optimization of this technical solution, step S3, the fully convolutional network divides the input image into shape blocks, and on the class activation map obtained by fusing information from each feature layer of the classification network, thresholding and pooling operations are performed on the activation values within the shape block range.
[0021] For each shape block a i The activation values of all pixels within the shape block are pooled. Here, max pooling is chosen because it preserves the most significant features compared to average pooling and other pooling methods, thus retaining important information in the image and reducing information loss. Specifically, it is expressed as follows:
[0022] Y[i,j,c]=max(X[h:h+ph,w:w+pw,c])
[0023] Where Y is the output feature map after pooling, i and j represent the height and width coordinates of the output feature map, c represents the number of channels, h and w represent the starting coordinates of the input feature map, and ph and pw represent the height and width of the pooling region. Max pooling is performed on the pixels within the region to obtain the pooled class activation map. Max pooling selects the maximum value in a local region and has a certain degree of invariance to slight translation changes. Compared with average pooling, max pooling can set the activation value of the entire shape block region to the maximum activation value. For the problem of low activation values in the class activation map at the object edges, max pooling can use the edge information of the shape block itself to solve the problem. The pooled class activation map can obtain an activation region with clear boundaries that can extend to the entire object. After thresholding, the activation region can be used as the seed region for pseudo-label generation.
[0024] Further optimization of this technical solution involves step S4, which trains a semantic association network based on shape blocks to refine the seed region. The semantic association network uses ResNet50 as its backbone and determines whether a shape block belongs to the same semantic category by calculating the association degree between each shape block and its neighboring shape blocks. During training, the shape block category information obtained in step S3 within the seed region is used as label data. If two shape blocks belong to the same category, the association degree label is 1; otherwise, it is 0. The semantic association network is input to the segmented shape blocks. After feature extraction, it calculates an association degree value 'a' between all pairs of shape blocks. This value ranges from a ∈ [0,1]. The closer the association degree value 'a' is to 0, the weaker the association between the two shape blocks; the closer it is to 1, the stronger the association. During training, a loss function is used to calculate the loss constraint model training process for low-level semantic similarity and shape block class consistency. The loss function 'l' is expressed as:
[0025]
[0026] Among them, q i Let q represent the probability distribution of the predicted class for pixel i. i ) pi Let represent the probability distribution of the predicted class for pixel i obtained based on the partial pseudo-labels pi, and N represent the number of pixels used to calculate the loss. In the second part of the loss function, Φ i,j Constraints representing low-level semantic similarity.
[0027] The loss function consists of two parts: the first part represents the shape block class consistency constraint, and the second part represents the low-level semantic similarity constraint Φ between pixels i and j. i,jSpecifically, it is expressed as:
[0028]
[0029] Where c i c j f represents the color information of pixel i and pixel j, respectively. i and f j σi and σj represent the positional information of pixels i and j, respectively. σ1 and σ2 are hyperparameters. Low-level semantic similarity participates in the calculation of correlation by constraining the color and positional information within pixels. After training, given a data input divided into shape blocks, the semantic correlation network can obtain the correlation between shape blocks. Shape blocks can be viewed as graph-structured data, where the node feature vectors include positional and RGB information. Edges between graph nodes are represented by an adjacency matrix, and the network outputs the shape block correlation as the value of the adjacency matrix. The adjacency matrix directly affects the graph convolution operation. In graph convolutional networks, the adjacency matrix is used to aggregate information from neighboring nodes. When the adjacency matrix represents node correlation, the graph convolution operation will focus on the semantic correlation between nodes, better capturing local structural information in the graph. The network processes the image of the divided shape blocks as input and outputs a correlation matrix, which serves as the adjacency matrix of the graph convolutional network for semantic classification of graph nodes.
[0030] In a further optimization of this technical solution, in step S5, the graph convolutional network uses the adjacency matrix output by the semantic association network to perform semantic classification of graph nodes. The adjacency matrix reflects the degree of association between nodes. During the forward propagation of the graph convolutional network, each layer uses the adjacency matrix of the nodes to perform information aggregation and feature extraction, and the graph convolution performs semantic propagation, which is specifically represented as follows:
[0031]
[0032] Where X and X′ represent the input and output of the graph convolution, respectively, and k represents a k-sized learning convolutional filter, which acts as a squared convolutional filter for graph-structured data. The node's output X′ embedding is a weighted sum of these filter outputs. A represents the adjacency matrix, and D is the logarithmic matrix. The graph convolutional layer extracts node features and the strength of relationships between nodes, thus possessing node classification capabilities. During training, to address the issue that the number of shape blocks belonging to the background class may be greater than, or even far greater than, the number of shape blocks belonging to the target object, a subgraph sampling network is used to solve the class imbalance problem in node classification, performing well in small classes. Specifically, the graph convolutional network uses subgraph sampling, that is, it only aggregates feature information from a certain number of neighboring nodes and then iteratively updates it. As the number of iterations increases, the aggregated information of each node expands globally, improving efficiency and enhancing the network's generalization ability during large-scale graph-structured data training. The aggregation process uses pooling aggregation, that is, first performing a non-linear transformation on the feature embedding of each node in the previous layer, and then performing average or max pooling on the result. This aggregation method only relates to the current k-order neighbor nodes and does not need to consider global nodes, thus possessing inductive learning capabilities. After completing the feature aggregation process, the output layer performs semantic classification on the nodes, aggregating shape blocks with the same category to form pseudo-labels.
[0033] In a further optimization of this technical solution, the sub-masks of each category obtained by the graph convolutional network in step S6 constitute a mask dataset, which is used for training the supervised semantic segmentation network. After being fed into the semantic segmentation network, the probability of the category to which each pixel belongs is obtained. The training is constrained by the cross-entropy loss function with the one-hot encoding of the real labels in the mask dataset. The supervised semantic segmentation network here uses the deepLab v2 network, which adopts a deep convolutional neural network structure and is based on a fully convolutional network (FCN), which can achieve fine semantic segmentation.
[0034] This technical solution is further optimized by including a GCNConv layer, a ReLU nonlinear layer, and a dropout operation in the graph convolutional network. The GCNConv layer projects the feature vector into a low-dimensional space. After activation by the ReLU layer, a dropout operation is performed to prevent overfitting. The nonlinear transformation of the feature vector projection into the low-dimensional space is implemented using neurons in the neural network. ReLU introduces nonlinearity, giving the network stronger representation capabilities. The aggregation function of the graph convolutional network needs to be adaptive to the number of aggregation nodes. The aggregation function uses pooling aggregation. The aggregation process uses pooling aggregation, that is, first performing a nonlinear transformation on the feature embedding of each node in the previous layer, and then performing average or max pooling on the result. This aggregation method only depends on the current k-order neighbor nodes and does not need to consider global nodes, so it has inductive learning capabilities. Finally, the last GCNConv layer of each node in the previous layer embeds the low-dimensional node representation into the category space, thereby achieving node classification.
[0035] This technical solution is further optimized by inputting the graph structure data and the adjacency matrix information output by the semantic association network into a graph convolutional network. The node information in the graph structure data includes a 5-dimensional vector containing RGB color information and XY position information. The node association information output by the semantic association network serves as the adjacency matrix A, representing the semantic association between nodes. This real-valued matrix network represents semantic similarity; the avoidance of a binary matrix allows the graph convolution to capture more detailed local information. After multiple layers of graph convolution operations, the final node embedding vector is input to the output layer for node classification. The softmax function is used to output the probability that each node belongs to a different class, performing class prediction for each node. The class of the shape block to which each node belongs becomes the class label for all pixels in that shape block, achieving pixel-level accurate segmentation using pseudo-labels.
[0036] The advantages of the above technical solution, which differ from existing technologies, are as follows:
[0037] 1. By utilizing shape blocks with semantic information as the basic unit of semantic segmentation, the problem of blurred boundaries in weakly supervised semantic segmentation methods is improved. By incorporating rich semantic information, the system can accurately understand objects and their boundaries in images, thereby enhancing the performance of semantic segmentation.
[0038] 2. This method enables semantic segmentation to be flexibly applied to scenarios where pixel-level annotations are lacking and only image-level weakly supervised label data is available. It transforms image-level weakly supervised labels into pixel-level fine segmentation when data annotations are scarce, achieving accurate semantic segmentation results even in the absence of detailed annotations.
[0039] 3. This method uses pooling activation maps, employing max pooling. It leverages the shape prior information of the shape block itself to extend the maximum activation value to all pixels within the shape block's range. This allows the activation value of the activation map to extend to the entire target object region, refining the initialization area of the pseudo-label and improving the accuracy of the final pseudo-label.
[0040] 4. Introducing graph convolutional networks to process graph-structured data can improve algorithm efficiency. By representing images with graph-structured data using shape blocks as nodes, the correlation information between shape blocks in the image can be better captured, thereby improving the speed and performance of the algorithm when processing large-scale image data. Attached Figure Description
[0041] Figure 1 This is a flowchart of a weakly supervised semantic segmentation method based on shape block semantic association.
[0042] Figure 2 This is an embodiment of a weakly supervised semantic segmentation framework based on shape block semantic correlation.
[0043] Figure 3 This describes the impact of shape block pooling on the original class activation map results. Detailed Implementation
[0044] To explain in detail the technical content, structural features, objectives, and effects of the technical solution, the following description is provided in conjunction with specific embodiments and accompanying drawings.
[0045] This invention proposes a weakly supervised semantic segmentation method based on shape block semantic association. It utilizes shape block pooling to obtain more complete seed regions in the CAM (Cognitive Aspect Ratio), and calculates the semantic association between shape blocks as the state transition matrix. A graph convolutional network is used to classify the shape blocks, thereby obtaining semantic segmentation pseudo-label data for training the supervised semantic segmentation network in the second stage.
[0046] To address the problem that generated class activation maps in weakly supervised semantic segmentation cannot cover the entire region of the target object, this invention uses pooled class activation maps to generate clear segmentation boundaries and allows activation values to spread to the entire region of the object, resulting in the generation of a more complete seed region.
[0047] To address the issue that using pixels to calculate correlation in a correlation calculation network cannot account for long-distance semantic relationships, this invention uses shape blocks instead of pixel blocks for correlation calculation. This enables the calculation of semantic information over a wider range, resulting in a more accurate correlation matrix and thus generating more accurate pseudo-labels.
[0048] To address the problem that shape-block-based image representations are difficult to process using convolutional neural networks, this invention uses graph convolutional networks for semantic propagation. Furthermore, for high-resolution images, shape-block-based graph convolutional networks are more computationally efficient and have stronger scalability.
[0049] Example 1:
[0050] The key idea behind the invention is:
[0051] 1) Design a weakly supervised semantic segmentation framework based on shape block correlation, and obtain a more complete and accurate activation region in the class activation graph through shape block pooling.
[0052] 2) Use the generated complete activation regions as category labels, use shape blocks to calculate the correlation, and constrain the model training process by introducing loss for low-level semantic similarity and shape block class consistency.
[0053] 3) Use shape blocks as nodes of a graph convolutional network and leverage the classification capabilities of the graph convolutional network to obtain the category information of each shape block, thereby obtaining pixel-level segmentation results for the entire image.
[0054] In this weakly supervised semantic segmentation task, the only supervision signal used is the image category label. This invention introduces a shape pooling module and a shape-based correlation calculation module to obtain the correlation information between shape blocks. This information is used by the graph convolution module to realize node semantic classification, and finally realize pseudo-label generation.
[0055] See Figure 1 This invention proposes a weakly supervised semantic segmentation method based on shape blocks, comprising the following steps:
[0056] S1. Input the original image into the classification network to obtain the class activation map.
[0057] Step S1 uses a classification network to obtain a class activation map based on shape block pooling. The classification network uses ResNet50 as the backbone network and is trained using the PASCAL VOC2012 dataset. The training specifically includes inputting the original image into the classification network. The network output is a probability vector of the class. The loss is calculated by using the cross-entropy loss function with the one-hot encoding of the real image class label. The training is completed using gradient backpropagation.
[0058] In this example, the input to the classification network is RGB color aerial images from the training dataset. The classification network uses ResNet50 as its backbone, forming a deep network through stacked residual blocks. Each basic residual block contains two branches: an identity mapping and two additional convolutional layers. The outputs of these two branches are summed. The network contains several 3×3 and 1×1 convolutional kernels. 1×1 kernels are used to reduce dimensionality, while 3×3 convolutional layers are used for feature extraction. Global average pooling is used at the end of the network to convert the feature map of the last convolutional layer into a vector, which serves as the input to the classifier. During classification training, the feature vectors obtained by the backbone network are finally processed by softmax to obtain the classification probabilities for each category. These probabilities are then used to calculate the cross-entropy loss with the one-hot encoding of the ground truth labels for training. After the classification network is trained, for the RGB color aerial images to be processed, the input network obtains the classification probabilities for each category. These probabilities serve as weights for class activation in the spatial representation. These activation weights are multiplied by the corresponding original feature maps and summed to obtain the class activation map.
[0059] S2. Obtain graph structure data with shape blocks as nodes from the original image through the shape segmentation module.
[0060] Step S2: The fully convolutional network divides the input image into shape blocks. On the class activation map obtained by fusing information from each feature layer of the classification network, the activation values within the shape blocks are thresholded and pooled to obtain activation regions with clear boundaries that can be extended to the entire object. After thresholding, the activation regions can be used as seed regions for pseudo-label generation.
[0061] Step S2 uses a fully convolutional network (FCN) to segment shape blocks, constructing a graph structure data with shape blocks as nodes from the image data. Each image yields a graph structure data set, where node information includes category labels, location information, and RGB information; the adjacency matrix represents the correlation information between nodes. This is used for the shape block classification task using graph convolutional networks. Shape block category classification using graph convolutional networks includes: graph structure data generation, graph convolutional network construction, and graph node semantic classification.
[0062] The RGB color aerial image to be processed is input into the shape block segmentation module, which uses a fully convolutional network (FCN). The classification network uses an encoder-decoder architecture. The network output is the shape block segmentation result. Pixels belonging to the same shape block have the same label information to distinguish different shape blocks.
[0063] The shape blocks obtained in the shape block segmentation module are used as nodes in the graph structure data. Each input RGB color aerial image is processed to obtain a graph structure data, where the node information includes the shape block's category information, position information, and RGB color information. The position information and color information use the average position of all pixels within the shape block and the average value of each color channel, specifically represented as follows:
[0064]
[0065] Where C j and P j Let R, G, and B represent the color and position information of the j-shaped block node, respectively. Let R, G, and B represent the color information of the three channels of the image within the shape block, and let x and y represent the position information of the j-shaped block. The edge information between nodes is represented by an adjacency matrix to show the degree of association between nodes.
[0066] S3. Using the shape block partitioning results in S2, perform shape block pooling on the class activation graph to obtain a pooled class activation graph.
[0067] In step S2, the shape block segmentation results are obtained. For each shape block segmented from the input image, the class activation values of all pixels within each shape block are obtained. These activation values can be obtained from the class activation map in S1. Average pooling is performed on the activation values within the shape blocks. This operation yields a clear boundary and extends the activation values to the entire object region. Subsequently, a thresholding operation is performed on the pooled class activation map. For each shape block, a judgment is made: if the activation value is greater than the threshold, the class label of that shape block is set to 1; otherwise, it is set to 0. This process obtains the class labels of all shape blocks. These class labels serve as important information in the graph structure data for subsequent calculations.
[0068] The fully convolutional network divides the input image into shape blocks. On the class activation map obtained by fusing information from each feature layer of the classification network, thresholding and pooling operations are performed on the activation values within each shape block. For each shape block 'a'... i The activation values of all pixels within it are pooled, specifically as follows:
[0069] Y[i,j,c]=max(X[h:h+ph,w:w+pw,c])
[0070] Where Y is the output feature map after pooling, i and j represent the height and width coordinates of the output feature map, c represents the number of channels, h and w represent the starting coordinates of the input feature map, and ph and pw represent the height and width of the pooling region. Max pooling is performed on the pixels within the region to obtain the pooled class activation map. The pooled class activation map can produce activation regions with clear boundaries that can extend to the entire object. After thresholding, the activation regions can be used as seed regions for pseudo-label generation.
[0071] S4. Use the confidence regions in the pooling activation graph to train the semantic association network.
[0072] Step S4 uses a shape-block-based semantic association network to refine the seed region. The semantic association network uses ResNet50 as the backbone network. It calculates the association degree between each shape block and its surrounding adjacent shape blocks. The category information within the seed region obtained in the previous step is used as label data for training. If two shape blocks belong to the same category, the association degree label is 1; otherwise, it is 0. The association degree matrix output by the network is used as the initial probability matrix for random walk, thus obtaining the refined seed region.
[0073] The training process of the model constrained by calculating low-level semantic similarity and shape block class consistency using a loss function, wherein the loss function is expressed as:
[0074]
[0075] Among them, among them, among them, q i Let q represent the probability distribution of the predicted class for pixel i. i ) pi Let represent the probability distribution of the predicted class for pixel i obtained based on the partial pseudo-labels pi, and N represent the number of pixels used to calculate the loss. In the second part of the loss function, Φ i,j The loss function represents the constraints on low-level semantic similarity. The first part represents the shape block class consistency constraint, and the second part represents the low-level semantic similarity constraint between pixels i and j, specifically expressed as follows:
[0076]
[0077] Where c i Represents the color information of pixel i, f i σi represents the positional information of pixel i, and σ1 and σ2 are hyperparameters. Low-level semantic similarity participates in the calculation of association by constraining information such as color and position within pixels. After network training, the network processes the image divided into shape blocks as input and outputs an association matrix, which serves as the adjacency matrix of the graph convolutional network for semantic classification of graph nodes.
[0078] Step S4 uses a shape-block-based semantic association network to perform adjacency matrix calculations. The semantic association network uses ResNet50 as its backbone. During network training, the network predicts the association degree between each pair of shape blocks; the closer the predicted value is to 1, the higher the probability that the two shape blocks belong to the same category. The association degree label is calculated using the category information of the shape blocks obtained in S3. If the category information is the same, the association degree label is 1; otherwise, it is 0. The cross-entropy loss is calculated between the network's predicted association degree and the true association degree label to perform semantic association network calculations. After the semantic association network is trained, the semantic association degree information between each pair of shape blocks can be obtained after processing the input image divided into shape blocks. This information serves as the adjacency matrix between nodes in the graph structure data described in S2.
[0079] S5. Graph convolutional networks use the adjacency matrix output by the semantic association network to perform semantic classification of graph nodes and aggregate them into pseudo-labels.
[0080] Graph convolutional networks (GCNNs) use the adjacency matrix output by a semantic association network to semantically classify graph nodes. The adjacency matrix reflects the degree of association between nodes. During the forward propagation of the GCNN, each layer uses the adjacency matrix of the nodes for information aggregation and feature extraction. The graph convolution performs semantic propagation, which is specifically represented as follows:
[0081]
[0082] The output layer of the graph convolutional network performs semantic classification on nodes, aggregating shape blocks of the same category to form pseudo-labels. Here, X and X′ represent the input and output of the graph convolution, respectively, where k represents a k-sized learning convolutional filter, acting as a squared convolutional filter for the graph structure data. The node's output X′ embedding is a weighted sum of these filter outputs. A represents the adjacency matrix, and D is the logarithmic matrix. The graph convolutional layer extracts node features and the strength of relationships between nodes, thus possessing node classification capabilities. For remote sensing image data characterized by foreground-background class imbalance due to small target objects, a subgraph sampling network is used to address the class imbalance problem in node classification. Specifically, the graph convolutional network uses subgraph sampling, that is, it only aggregates feature information from a certain number of neighboring nodes and then iteratively updates it. As the number of iterations increases, the aggregated node information is expanded globally, improving efficiency and enhancing the network's generalization ability during large-scale graph structure data training. For high-resolution images, it can also improve network efficiency.
[0083] The sub-masks of each category obtained using the graph convolutional network in step S5 constitute a mask dataset, which is used for training the supervised semantic segmentation network. After being fed into the semantic segmentation network, the probability of the category to which each pixel belongs is obtained. The training is constrained by the cross-entropy loss function with the one-hot encoding of the real labels in the mask dataset. The supervised semantic segmentation network here uses the deepLab v2 network.
[0084] The graph convolutional network (GCNV) consists of a CGNConv layer, a ReLU nonlinear layer, and a dropout operation. The CGNConv layer projects the feature vector into a low-dimensional space, activates it through the ReLU layer, and then performs a dropout operation to prevent overfitting. The final CGNConv layer embeds the low-dimensional node representation into the category space. During training, the GCNV's prediction output for the node category uses the category label obtained in S3 as the ground truth category, and the two are used to calculate the cross-entropy loss for training the GCNV. After the GCNV is trained, given an input RGB image divided into shape blocks and the adjacency matrix obtained in S4, the GCNV calculates the category information for each shape block, thus completing the category prediction for each node in the graph structure data. The category of the shape block to which each node belongs represents the category information of all pixels belonging to that shape block. Therefore, the category information of each pixel in the input image can be obtained, resulting in a pseudo-label for accurate segmentation.
[0085] S6. Use pseudo-labels to train a semantic segmentation network that receives the original image and outputs the predicted semantic segmentation results.
[0086] The graph structure data and the adjacency matrix information output by the semantic association network are input into the graph convolutional network. The network performs category prediction for each node, and the category of the shape block where each node is located is the category label of all pixels in that shape block, thus achieving accurate segmentation of pseudo-labels.
[0087] The precise pseudo-labels obtained in S5 are used as pixel-level label data for training a supervised semantic segmentation network. The deepLab v2 network is used as the supervised semantic segmentation network. RGB color aerial images are input into this semantic segmentation network, and supervised learning is performed using the corresponding pseudo-label data. The trained supervised semantic segmentation network can then be used for various semantic segmentation tasks with large amounts of unlabeled data.
[0088] The weakly supervised semantic segmentation method based on shape block correlation proposed in this invention includes the following modules:
[0089] 1. Shape block segmentation module: It inputs the original image into the shape block generation network and generates oversegmented shape blocks through the fully convolutional network. This module requires the use of a pre-trained network.
[0090] 2. Shape Block Pooling Module: It performs max pooling on the class activation map within each shape block region obtained above. After pooling, the activation values within the shape block can be extended to the pixels in the entire shape block, thereby obtaining a more complete and accurate activation region when distinguishing confidence regions based on activation value thresholds.
[0091] 3. Shape Block Association Calculation Module: It uses shape blocks as the basic unit for association calculation. Based on the shape representation, the image will obtain a semantic association matrix after passing through the shape block association network. The training is constrained by a hybrid loss function. The network model parameters of the classification network and the shape block generation network are fixed during the pre-training process. The pre-training dataset can be prepared by the user. Only the parameters of the association network are updated.
[0092] 4. Graph Convolution Module: Uses superpixel segmentation results to form shape blocks with semantic information. These shape blocks are used as nodes in a graph convolutional network. The graph convolutional network is trained on the input graph structure data and outputs the category to which each node belongs, as well as the object category to which each shape block belongs, thus obtaining semantic pseudo-labels for training a supervised semantic segmentation network.
[0093] The features and advantages of specific embodiments of the present invention are described in detail below with reference to the accompanying drawings:
[0094] 1) Pooling-type activation graph
[0095] The class activation map (CAM) generated by the classification network pre-trained on the PASCAL VOC2012 dataset can be used as a seed region for pseudo-label generation. A threshold is then used to distinguish the activation regions. However, the CAM activation regions obtained by the classification network often fail to cover the entire object region. Using shape block constraint pooling can obtain more complete and accurate activation regions, specifically as follows:
[0096] Y[i,j,c]=max(X[h:h+ph,w:w+pw,c])
[0097] like Figure 1 The image shows the CAM, pooled CAM, and segmentation mask obtained from PASCAL VOC2012. A mask is generated for each object class in each image in the dataset. The complete activation region is used as the confidence region. The category label of the confidence region is obtained for training the shape association network.
[0098] 2) Graph Convolutional Networks
[0099] This invention performs weakly supervised semantic segmentation of image-level labels based on graph convolutional networks and shape block correlation networks. It uses a shape block generation network and a CAM pooling module to optimize the CAM activation region. (See reference...) Figure 2Using shape blocks for image representation, and training semantic relational networks with shape blocks as basic units of the image, can better capture long-distance dependencies.
[0100] 3) Core Idea
[0101] This invention performs weakly supervised semantic segmentation of image-level labels based on a shape block association network. It uses a shape block generation network and a CAM pooling module to optimize the CAM activation region, uses shape blocks as the basic units of the image to train the semantic association network, and uses a graph convolutional network to classify the shape blocks, forming refined and complete pseudo-label data.
[0102] 4) Model Design
[0103] Based on the above analysis, the present invention designs Figure 1 The process framework shown in this embodiment of the invention mainly consists of the following modules:
[0104] 1. Shape Block Segmentation Module. Here, the Superpixel-FCN pre-trained network is used to segment the original image into shape blocks of the required size. Each shape block has basic semantic information and belongs to the same object category by default.
[0105] 2. Classification Network. This network uses ResNet50 as the backbone network. It obtains class activation maps based on the input image, and then performs thresholding and pooling on the activation values inside the shape blocks to obtain class activation maps with clear boundaries that extend to the entire object region.
[0106] 3. Relevance Calculation Module. Shape blocks are used as the basic unit for relation network calculation to perform semantic relation calculation. The resulting relation matrix is used as the initialization matrix for random walks to refine the seed region.
[0107] 4. Graph Convolution Module. This module uses graph convolutional networks to classify node categories in graph-structured data where shape blocks are nodes, thus obtaining refined pseudo-labels.
[0108] 5. Loss Function: In classification networks, the goal is for the network's predictions to be as close as possible to the true class, so cross-entropy loss is used. Similarly, in graph convolutional networks, the goal is for the network's class information for each node to be as close as possible to the true node's class label, so cross-entropy loss is also used. In correlation networks, the correlation between shape blocks should consider not only high-level semantic correlation but also low-level semantic information such as color and position. Therefore, L1 loss, which represents the low-level semantic difference, is added to the cross-entropy loss as the overall loss function.
[0109] 5) Training Process
[0110] 1. Pre-trained model.
[0111] The classification network is pre-trained on the ImageNet dataset, and it is necessary to ensure that all image categories appearing in the PASCAL VOC2012 dataset also appear in the pre-training dataset.
[0112] 2. Training process.
[0113] During training, the classification network and shape block segmentation network are no longer updated; only the parameters of the correlation network and graph convolutional network are updated. Backpropagation of errors is used for model training.
[0114] 3. Testing process.
[0115] After the correlation network and graph convolutional network are trained, the original image is input for prediction to obtain semantic segmentation pseudo-labels. The testing process only performs the forward process and does not update the network model parameters.
[0116] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Unless otherwise specified, an element defined by the phrase "comprising..." or "including..." does not exclude the presence of additional elements in the process, method, article, or terminal device that includes said element. Additionally, in this document, "greater than," "less than," "exceeding," etc., are understood to exclude the stated number; "above," "below," "within," etc., are understood to include the stated number.
[0117] Although the above embodiments have been described, those skilled in the art, once they understand the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the above descriptions are merely embodiments of the present invention and do not limit the scope of patent protection of the present invention. Any equivalent structural or procedural transformations made using the content of the present invention's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the scope of patent protection of the present invention.
Claims
1. A weakly supervised semantic segmentation method based on shape block semantic association, characterized in that, Includes the following steps: Step S1: Input the original image into the classification network to obtain the class activation map; Step S2: Obtain graph structure data with shape blocks as nodes from the original image through the shape segmentation module; Step S3: Use the shape block partitioning results from step S2 to perform shape block pooling on the class activation graph to obtain a pooled class activation graph; In step S3, the fully convolutional network divides the input image into shape blocks. On the class activation map obtained by fusing information from each feature layer of the classification network, thresholding and pooling operations are performed on the activation values within each shape block. The activation values of all pixels within it are pooled, specifically as follows: in, This is the output feature map after pooling. and This represents the height and width coordinates of the output feature map. Indicates the number of channels. and Indicates the starting coordinates of the input feature map. and The height and width of the pooling region are represented. Max pooling is performed on the pixels within the region to obtain the pooled class activation map. The pooled class activation map can produce an activation region with clear boundaries that can be extended to the entire object. After thresholding, the activation region can be used as a seed region for pseudo-label generation. Step S4: Train the semantic association network using the confidence regions in the pooling activation graph; Step S4 trains a shape-block-based semantic association network to refine the seed region. The semantic association network uses ResNet50 as its backbone. It determines whether shape blocks belong to the same semantic category by calculating the association degree between each shape block and its neighboring shape blocks. During training, the category information within the seed region obtained in S3 is used as label data. If two shape blocks belong to the same category, the association degree label is 1; otherwise, it is 0. The training process constrains the low-level semantic similarity and shape block class consistency loss through a loss function. Represented as: in, Indicates for pixels Predict the probability distribution of the category. This indicates that based on some pseudo-tags The obtained for pixels Predict the probability distribution of the category. This represents the pixels used to calculate the loss, in the second part of the loss function. Constraints representing low-level semantic similarity; The first part of the loss function represents the shape block class consistency constraint, and the second part represents the pixel... and Constraints on low-level semantic similarity between Specifically, it is expressed as: in , They represent Pixels and Pixel color information, and They represent Pixels and Pixel position information, and These are hyperparameters. Low-level semantic similarity participates in the calculation of association degree by constraining the color and position information within pixels. After the network is trained, the network processes the image divided into shape blocks as input and outputs the association degree matrix, which is used as the adjacency matrix of the graph convolutional network for semantic classification of graph nodes. Step S5: The graph convolutional network uses the adjacency matrix output by the semantic association degree network to perform semantic classification of graph nodes and aggregate them into pseudo-labels; In step S5, the graph convolutional network uses the adjacency matrix output by the semantic association network to perform semantic classification of graph nodes. The adjacency matrix reflects the degree of association between nodes. During the forward propagation of the graph convolutional network, each layer uses the adjacency matrix of the nodes to perform information aggregation and feature extraction. The graph convolution performs semantic propagation, which is specifically represented as follows: in Let represent the input and output of the graph convolution, respectively. Represent a The lesson covers convolutional filters, specifically squared convolutional filters used for graph-structured data, and the output of each node. Embedding is a weighted sum of the outputs of these filters. Represents the adjacency matrix. It is a angular matrix. The graph convolutional layer extracts node features and the strength of relationships between nodes, thus having the ability to classify nodes. The output of the node is the output layer of the graph convolutional network. The output layer of the graph convolutional network performs semantic classification on the nodes and aggregates shape blocks with the same category to form pseudo-labels. Step S6: Train a supervised semantic segmentation network using pseudo-labels. This network receives the original image and outputs the predicted semantic segmentation results.
2. The weakly supervised semantic segmentation method based on shape block semantic correlation as described in claim 1, characterized in that, Step S1 uses a classification network to obtain a class activation map for generating confidence regions. The classification network uses ResNet50 as the backbone network and is trained using the PASCAL VOC2012 dataset. The training specifically includes inputting the original image into the classification network, the network output being a probability vector of the class, calculating the loss by combining it with the one-hot encoding of the real image class label using the cross-entropy loss function, and completing the training using gradient backpropagation.
3. The weakly supervised semantic segmentation method based on shape block semantic correlation as described in claim 1, characterized in that, Step S2 uses a fully convolutional network to divide the input image into shape blocks and construct graph structure data with shapes as nodes. Specifically, it includes: using shape blocks as nodes, each image yields a graph structure data, and the node information includes category label, location information, and RGB information; the edges between nodes are represented by an adjacency matrix to indicate the degree of association between nodes. The graph structure data is used by the graph convolutional network to perform semantic classification of shape block categories. The steps include: graph structure data generation, graph convolutional network construction, and graph node semantic classification.
4. The weakly supervised semantic segmentation method based on shape block semantic correlation as described in claim 1, characterized in that, In step S6, the sub-masks of each category obtained using the graph convolutional network constitute a mask dataset, which is used for training the supervised semantic segmentation network. After being fed into the semantic segmentation network, the probability of each pixel belonging to a category is obtained. The one-hot encoding of the real labels in the mask dataset is used to constrain the training with the cross-entropy loss function. Here, the supervised semantic segmentation network uses the deepLab v2 network.
5. The weakly supervised semantic segmentation method based on shape block semantic correlation as described in claim 1, characterized in that, The graph convolutional network includes a CGNConv layer, a ReLU nonlinear layer, and a dropout operation. The CGNConv layer projects the feature vector into a low-dimensional space. After activation by the ReLU layer, it undergoes a dropout operation to prevent overfitting. The final CGNConv layer embeds the low-dimensional node representation into the class space, thereby achieving node classification.
6. The weakly supervised semantic segmentation method based on shape block semantic correlation as described in claim 1, characterized in that, The graph structure data and the adjacency matrix information output by the semantic association network are input into the graph convolutional network. The network performs category prediction for each node, and the category of the shape block where each node is located is the category label of all pixels in that shape block, thus achieving pixel-level accurate segmentation of pseudo-labels.