Weakly supervised semantic segmentation method, system, device and medium based on random combination

By training multiple classification networks and combining them with image slices and random combinations of different inputs, the mutual supervision of prediction results from multiple networks solves the problem of insufficient network awareness in existing methods and improves the prediction accuracy of semantic segmentation models.

CN115761234BActive Publication Date: 2026-06-26YANGTZE DELTA REGION INST OF UNIV OF ELECTRONICS SCI & TECH OF CHINE (HUZHOU)

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
YANGTZE DELTA REGION INST OF UNIV OF ELECTRONICS SCI & TECH OF CHINE (HUZHOU)
Filing Date
2022-11-22
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing weakly supervised semantic segmentation methods ignore the perception of different regions in the same image by different networks and different CAM methods, resulting in obtaining only a semantic segmentation training set with partial image semantic annotations, which cannot make full use of the semantic segmentation results of multiple networks and multiple CAM methods.

Method used

Three classification networks (ResNet50, InceptionV3, and DenseNet121) were used to train images with different inputs. The prediction results of each network were combined to form a semantic segmentation training dataset to train the semantic segmentation model. The model was trained using the cross-entropy loss function and supervised by CAM and Grad-CAM++ methods. Finally, the prediction results were combined using a weighted average method.

Benefits of technology

By using multi-network mutual supervision and random combination of image slices, more prediction regions are activated, improving the prediction accuracy of the semantic segmentation model, overcoming the limitations of single-network methods, and achieving better semantic segmentation results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115761234B_ABST
    Figure CN115761234B_ABST
Patent Text Reader

Abstract

The application belongs to the field of computer vision, and discloses a weakly supervised semantic segmentation method, system, device and medium based on random combination, which comprises the following steps: training classification networks N1, N2 and N3 respectively by using a training data set, a slice training data set and a slice training data set randomly combined, so that each network can extract different active regions in the picture, and the learning results of the other two networks are learned by using the mutual supervision training mode; finally, the prediction results of the three networks are combined to obtain the final semantic segmentation result, which is used as a semantic segmentation training data set to train a semantic segmentation model to predict the final semantic segmentation result. The application effectively utilizes the different perception areas of the network for the randomly combined slice pictures, and utilizes the different classification networks to perceive the categories of the same picture, thereby improving the semantic segmentation ability and prediction accuracy of the semantic segmentation model through the semantic segmentation data set obtained by combining the results of the three classification networks.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of computer vision, and in particular relates to a weakly supervised semantic segmentation method, system, device and medium based on random combination. Background Technology

[0002] Currently, semantic segmentation, as one of the three major tasks in computer vision, has many applications in real-world scenarios, such as medical image processing, autonomous driving, and environmental perception. However, due to the difficulty in collecting its training dataset, semantic segmentation tasks are significantly limited in real-world applications.

[0003] To address the lack of training datasets for semantic segmentation tasks, researchers have proposed weakly supervised semantic segmentation. This method utilizes image-level labeled datasets to obtain pixel-level labeled semantic segmentation datasets, thereby reducing the amount of annotations required for training semantic segmentation models.

[0004] Existing weakly supervised semantic segmentation tasks can be broadly categorized into two types. One type uses the class activation map (CAM) of a classification network as a seed, refining it through various methods to obtain a semantic segmentation dataset. The other type first obtains a semantic segmentation dataset using the methods described above, then trains a semantic segmentation network using the dataset, and finally refines the semantic segmentation dataset through self-learning to obtain the final semantic segmentation network.

[0005] Based on the above analysis, the problems and shortcomings of the existing technology are as follows:

[0006] Most existing methods use a single method or network for weakly supervised semantic segmentation. This type of weakly supervised semantic segmentation ignores the perception of different regions in the same image by different networks and different CAM methods, and always obtains a semantic segmentation training set with only partial image semantic annotations. Summary of the Invention

[0007] To address the problems existing in the prior art, this invention provides a weakly supervised semantic segmentation method, system, device, and medium based on random combination.

[0008] This invention is implemented as follows: a weakly supervised semantic segmentation method based on random combination, wherein the weakly supervised semantic segmentation method based on random combination includes:

[0009] Classification networks N1, N2, and N3 are trained using a training dataset, a slice training dataset, and a randomly combined slice training dataset, respectively. The prediction results of the three networks are combined to obtain the final semantic segmentation result, which is then used as the semantic segmentation training dataset to train the semantic segmentation model. The semantic segmentation model is then used to predict the semantic segmentation result of the final image.

[0010] Furthermore, the specific process of the weakly supervised semantic segmentation method based on random combination is as follows:

[0011] A classification network N1 is trained based on a training dataset with image category labels; a classification network N2 is trained using the classification network N1 and a slice training dataset obtained by cutting the training dataset; a classification network N3 is trained using the N1 and N2 and a randomly combined slice training dataset; the semantic segmentation results predicted by the classification networks N1, N2, and N3 based on the training dataset are combined to obtain a semantic segmentation training dataset; a semantic segmentation model is trained using the semantic segmentation dataset to predict images, and the prediction results are evaluated using the average intersection-union comparison.

[0012] Furthermore, the classification network N1 is a ResNet50 network with 50 neural network layers, namely 49 convolutional layers and 1 fully connected layer. The input of the classification network N1 is a 224×224 image, and the output is a classification score of 1×1000.

[0013] The classification network N2 uses the InceptionV3 network, which has 48 neural network layers, including 47 convolutional layers and one fully connected layer. The input is a 299×299 image, and the output is a classification score of 1×1000.

[0014] The classification network N3 uses the DenseNet121 network, which has 121 layers, including 120 convolutional layers and one fully connected layer. The input is a 224×224 image, and the output is a classification score of 1×1000.

[0015] The semantic segmentation network model uses the DeepLab V3 network.

[0016] Furthermore, during the training of the classification network N2, the semantic segmentation results of the classification network N1 are used as label-supervised training of the classification network N2, constituting part of the training loss of N2;

[0017] In the process of training the classification network N3, the semantic segmentation results of the classification network N1 and the semantic segmentation results of the classification network N2 are used as labels to supervise the training of the classification network N3, which constitutes part of the training loss of N3.

[0018] Furthermore, the loss function used when training the classification network is the cross-entropy loss function, and the specific formula is as follows:

[0019]

[0020] In the formula y i It is a real i-type tag. is the predicted label of class i, and C represents all categories.

[0021] Furthermore, the training classification network N2 also includes semantic supervision of classification network N1. The CAM semantic segmentation prediction result obtained by classification network N1 is A1, and the Grad-CAM semantic segmentation prediction result obtained by classification network N2 is A2. The squared difference between A1 and A2 is added as the loss to the loss function of classification network N2, and the specific formula is as follows:

[0022]

[0023] Furthermore, the training classification network N3 also includes semantic supervision from classification networks N1 and N2. The Grad-CAM++ semantic segmentation prediction result obtained by classification network N3 is A3. The squared differences between A1 and A3 and the squared differences between A2 and A3 are added as losses to the loss function of classification network N3, with the specific formula as follows:

[0024]

[0025] Furthermore, the semantic segmentation training dataset is obtained by combining the semantic segmentation prediction results of CAM, Grad-CAM and Grad-CAM+ using a weighted average method. The weight is calculated based on the highest value in each semantic segmentation prediction result, and then the three results are concatenated to synthesize the final prediction result according to the weight.

[0026] Another object of the present invention is to provide a weakly supervised semantic segmentation system based on random combination that implements the aforementioned weakly supervised semantic segmentation method based on random combination, the weakly supervised semantic segmentation system based on random combination comprising:

[0027] The preprocessing module is used to slice and randomly combine the training dataset to obtain sliced ​​training datasets and randomly combined sliced ​​training datasets.

[0028] The training module is used to train classification networks N1, N2, and N3, as well as a semantic segmentation model, based on the training dataset, the slice training dataset, and a random combination of slice training datasets.

[0029] The prediction module is used to obtain segmentation prediction results using classification networks and semantic segmentation models;

[0030] The evaluation module is used to evaluate the final prediction results.

[0031] Another object of the present invention is to provide a computer device including a memory and a processor, the memory storing a computer program, which, when executed by the processor, causes the processor to perform the steps of the weakly supervised semantic segmentation method based on random combination.

[0032] Another object of the present invention is to provide a computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to perform the steps of the weakly supervised semantic segmentation method based on random combination.

[0033] Based on the above technical solutions and the technical problems solved, the advantages and positive effects of the technical solution to be protected by this invention are as follows:

[0034] First, addressing the technical problems existing in the prior art and the difficulty in solving them, this paper closely analyzes, in conjunction with the technical solution to be protected by this invention and the results and data obtained during the research and development process, how the technical solution of this invention solves the technical problems, and the inventive technical effects brought about by solving these problems. The specific description is as follows:

[0035] This invention trains three image classification networks, N1, N2, and N3, respectively, and uses the activation maps of the image categories predicted by the three networks for mutual supervision. The inputs to the three classification networks are also different, using the original image, sliced ​​images, and randomly combined sliced ​​images, respectively. By inputting different training images, each network can extract different activation regions in the image, and the mutual supervision training method allows each network to learn the learning results of the other two networks. Finally, the prediction results of the three networks are combined to obtain the final semantic segmentation result. This combined result is used as the semantic segmentation training dataset to train a semantic segmentation model for predicting the final image semantic segmentation result. The semantic segmentation results of multiple classification networks are mutually supervised, and the mutual supervision of semantic segmentation prediction results is achieved using randomly combined image slices, activating more predicted regions.

[0036] Second, considering the technical solution as a whole or from a product perspective, the technical effects and advantages of the technical solution to be protected by this invention are specifically described as follows:

[0037] This invention effectively utilizes the different perceptual regions of randomly combined sliced ​​images by networks, and employs three different classification networks obtained through three different methods to perceive the category of the same image. By combining the three results, a better image semantic segmentation supervision dataset is obtained. This training dataset improves the semantic segmentation ability of the semantic segmentation model and enhances its prediction accuracy.

[0038] Third, as supplementary evidence of the inventive step of the claims of this invention, it is also reflected in the following important aspects:

[0039] Does the technical solution of this invention solve a technical problem that people have long desired to solve but have never been able to successfully address?

[0040] This invention utilizes the semantic segmentation results of different networks for mutual supervision, and provides a weakly supervised semantic segmentation method based on random combination using sliced ​​image blocks and randomly combined image blocks. This activates more prediction regions, improves prediction accuracy, and overcomes the technical difficulty of ordinary weakly supervised semantic segmentation methods that cannot utilize the semantic segmentation results of multiple networks and multiple CAM methods. Attached Figure Description

[0041] Figure 1 This is a flowchart of a weakly supervised semantic segmentation method based on random combination provided in an embodiment of the present invention;

[0042] Figure 2 This is a schematic diagram illustrating the specific process of the weakly supervised semantic segmentation method based on random combination provided in this embodiment of the invention;

[0043] Figure 3 These are schematic diagrams of prediction results provided in embodiments of the present invention: (a) the category activation map of N1, (b) the category activation map of N2, (c) the category activation map of N3, and (d) the category activation map of N4. Detailed Implementation

[0044] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.

[0045] To enable those skilled in the art to fully understand how the present invention is specifically implemented, this section provides an explanatory description of the embodiments that expand upon the technical solutions of the claims.

[0046] The weakly supervised semantic segmentation method based on random combination provided in this invention includes:

[0047] Step 1: Obtain the training image dataset with image category annotations;

[0048] Step 2: Train a classification network N1 using the training image dataset from Step 1;

[0049] Step 3: Slice the image from Step 1 and use the sliced ​​image patches to train a classification network N2. During the training of N2, the semantic segmentation results of the classification network N1 obtained in Step 2 are used as label supervision to train the classification network N2, forming part of the training loss of N2;

[0050] Step 4: Randomly combine the sliced ​​images from Step 3 to obtain new training images. Use these randomly combined images to train a classification network N3. During the training of N3, the semantic segmentation results of classification network N1 obtained in Step 2 and classification network N2 obtained in Step 3 are used simultaneously as label-supervised training results for classification network N3, forming part of the training loss of N3.

[0051] Step 5: Duplicate the training image from Step 1 into three copies. For each copy, use the classification network N1 obtained in Step 2, the classification network N2 obtained in Step 3, and the classification network N3 obtained in Step 4 to obtain the corresponding semantic segmentation result. Then, integrate the three semantic segmentation results using an ensemble algorithm to obtain the final semantic segmentation result of the training image, which will be used as the training data for the semantic segmentation model.

[0052] Step 6: Train a semantic segmentation model using the semantic segmentation training dataset obtained in Step 5;

[0053] Step 7: Use the semantic segmentation model trained in Step 6 to predict the semantic segmentation results of the image;

[0054] Step 8: Evaluate the semantic segmentation results obtained in Step 7.

[0055] Furthermore, in step two, a ResNet50 network was used as the classification network N1, which has 50 layers: 49 convolutional layers and 1 fully connected layer. Its input is a 224×224 image, and its output is a classification score of 1×1000.

[0056] Furthermore, in step three, the InceptionV3 network was used as the classification network N2, which has 48 layers. Its input is a 299×299 image, and its output is a classification score of 1×1000.

[0057] Furthermore, in step four, a DenseNet121 network was used as the classification network N3, which has 121 layers. Its input is a 224×224 image, and its output is a classification score of 1×1000.

[0058] Furthermore, the loss function used when training the classification networks N1, N2, and N3 is the cross-entropy loss function, with the specific formula as follows:

[0059]

[0060] Where y is the actual i-type tag. is the predicted label of class i, and C represents all categories.

[0061] Furthermore, as a preferred technical solution, in step three, when training the classification network N2, in addition to using the ordinary cross-entropy loss, semantic supervision from the classification network N1 is also added. Let the semantic segmentation prediction result obtained by the classification network N1 be A1, and the semantic segmentation prediction result obtained by the classification network N2 be A2. The squared difference between A1 and A2 is added as the loss to the loss function of the classification network N2. The specific formula is as follows:

[0062]

[0063] Furthermore, in step four, when training the classification network N3, in addition to using the ordinary cross-entropy loss, semantic supervision from classification networks N1 and N2 is also added. Let the semantic segmentation prediction result obtained by classification network N3 be A3. The squared differences between A1 and A3 and the squared differences between A2 and A3 are added as losses to the loss function of classification network N3. The specific formula is as follows:

[0064]

[0065] Furthermore, in steps two, three, and four, the classification networks N1, N2, and N3 respectively obtain their semantic segmentation results using the three methods CAM, Grad-CAM, and Grad-CAM++.

[0066] Furthermore, in step five, a weighted average method is used to obtain the final semantic segmentation result. The weight of each semantic segmentation prediction result is calculated based on the highest numerical value, and then the three results are combined according to their weights to obtain the final prediction result.

[0067] Furthermore, in step six, the obtained semantic segmentation training data is used to train a deeplabv3 network as a semantic segmentation prediction network.

[0068] Furthermore, in step eight, the semantic segmentation results obtained by this method are evaluated using the average intersection-union ratio.

[0069] To demonstrate the inventiveness and technical value of the technical solution of this invention, this section provides specific product or related technology application examples of the technical solution claimed.

[0070] The specific application process of the weakly supervised semantic segmentation method based on random combination provided in this embodiment of the invention includes:

[0071] Step 1: Obtain a training image dataset with image category labels. This invention uses the VOC2012 dataset, which contains 10,582 training images and 1,449 test images.

[0072] Step 2: Train a ResNet50 classification network N1 using the VOC2012 training dataset from Step 1. This network model has already been pre-trained on the ImageNet dataset. The original network consisted of 49 convolutional layers and 1 fully connected layer, with the final fully connected layer outputting a classification score of 1×1000. This invention retains the first 49 convolutional layers and uses a new fully connected layer to output a classification score of 1×20.

[0073] Step 3: The training images in the VOC2012 training dataset are segmented into 9 parts, with each part divided into 3 equal parts in both length and width. These images are then fed into the InceptionV3 classification network N2. This network model has already been pre-trained on the ImageNet dataset. The network has 48 layers, with the last fully connected layer outputting a classification score of 1×1000. This invention retains the preceding 47 convolutional layers and adds a new fully connected layer at the end, outputting a classification score of 1×20. Simultaneously, the CAM results of the ResNet50 classification network N1 from Step 2 are used to supervise the InceptionV3 classification network N2. The loss function of the InceptionV3 classification network N2 is as follows:

[0074]

[0075] Where y i It is a real i-type tag. Here, is the predicted label for class i, and C represents all classes. The CAM result obtained by classification network N1 is A1, and the Grad-CAM obtained by classification network N2 is A2.

[0076] Step 4: Randomly combine the sliced ​​images obtained in Step 3 into a new image and input it into the DenseNet121 classification network N3. This network has 121 layers, and the last fully connected layer outputs a classification score of 1×1000. This invention keeps the first 120 convolutional layers unchanged and uses a new fully connected layer to connect them, outputting a classification score of 1×20. Simultaneously, the CAM results of the ResNet50 classification network N1 and the Grad-CAM results of the InceptionV3 classification network N2 from Step 2 are used to supervise the results of the DenseNet121 classification network N3. The loss function of the DenseNet121 classification network N3 is as follows:

[0077]

[0078] The Grad-CAM++ result obtained from the DenseNet121 classification network N3 is A3.

[0079] Step 5: After obtaining three fully trained classification networks, use the training data from the VOC2012 semantic segmentation dataset in Step 1 again. Apply the classification network N1 obtained in Step 2, the classification network N2 obtained in Step 3, and the classification network N3 obtained in Step 4 to obtain the corresponding class activation maps using CAM, Grad-CAM, and Grad-CAM++, respectively. The weighted average of the three results is then used as the correct semantic segmentation dataset for the training dataset. The class activation maps obtained by the three methods and the final integrated class activation map are shown below. Figure 3 As shown.

[0080] Step 6: Use the pseudo-semantic segmentation training data of the VOC2012 training dataset obtained in Step 5 to train a deeplabv3 semantic segmentation model, and calculate the loss function using binary classification cross-entropy.

[0081] Step 7: Use the deeplabv3 semantic segmentation model trained in Step 6 to predict the semantic segmentation results of the test images in the test dataset in Step 1, and use the average crossover ratio (CLORD) of the classes as the evaluation criterion for the results.

[0082] It should be noted that embodiments of the present invention can be implemented in hardware, software, or a combination of both. The hardware portion can be implemented using dedicated logic; the software portion can be stored in memory and executed by a suitable instruction execution system, such as a microprocessor or dedicated-design hardware. Those skilled in the art will understand that the above-described devices and methods can be implemented using computer-executable instructions and / or included in processor control code, for example, such code provided on a carrier medium such as a disk, CD, or DVD-ROM, a programmable memory such as read-only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The devices and modules of the present invention can be implemented by hardware circuitry such as very large-scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field-programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of the above-described hardware circuitry and software, such as firmware.

[0083] The embodiments of the present invention have achieved some positive results during the research and development or use process, and have indeed great advantages compared with the prior art. The following content describes them in conjunction with the data, charts and other information of the experimental process.

[0084] As shown in Table 1, the method of this invention achieved a mean cross-intersection over union (CIU) ratio of 71.1 on the VOC2012 test set, which is higher than the 51.3 obtained by similar methods CAM and 66.8 obtained by Puzzle-CAM, demonstrating good performance. All three classifiers used in this invention are pre-trained on the ImageNet dataset. During the training of the classification network, only the parameters of the last fully connected layer are trained, while the parameters of other layers remain unchanged. This method obtains semantic segmentation prediction results for an image by using different networks and different methods of acquiring class activation maps. It leverages the mutual supervision between different networks to achieve complementary learning capabilities. Furthermore, by slicing the image and randomly combining image patches, each network learns different knowledge, ensuring complementary learning. Finally, combining the three results yields a better class activation map as training data for semantic segmentation.

[0085] Table 1 Evaluation Results

[0086]

[0087]

[0088] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any modifications, equivalent substitutions, and improvements made by those skilled in the art within the scope of the technology disclosed in the present invention, and within the spirit and principles of the present invention, should be covered within the scope of protection of the present invention.

Claims

1. A weakly supervised semantic segmentation method based on random combination, characterized in that, The weakly supervised semantic segmentation method based on random combination includes: Classification networks N1, N2, and N3 are trained using a training dataset, a slice training dataset, and a randomly combined slice training dataset, respectively. Classification network N1 obtains semantic segmentation prediction result A1 through CAM, classification network N2 obtains semantic segmentation prediction result A2 through Grad-CAM, and classification network N3 obtains semantic segmentation prediction result A3 through Grad-CAM++. The semantic segmentation prediction results of the three networks are combined to obtain the final semantic segmentation result, which is used as the semantic segmentation training dataset to train the semantic segmentation model. The semantic segmentation model is then used to predict the semantic segmentation result of the final image. The slice training dataset consists of image patches obtained by slicing the images in the training dataset, and the randomly combined slice training dataset consists of new training images obtained by randomly combining the sliced ​​image patches. A classification network N1 is trained using a training dataset with image category labels. A classification network N2 is trained using the semantic segmentation prediction results A1 from classification network N1 and a slice training dataset. Classification network N2 is trained with label supervision using the semantic segmentation prediction results A1 from classification network N1 and A2 from classification network N2, along with a randomly combined slice training dataset. Classification network N3 is trained with label supervision using the semantic segmentation prediction results A1 and A2 from classification network N1 and A2. The semantic segmentation prediction results of classification networks N1, N2, and N3 based on the training dataset are combined using a weighted average method to obtain a semantic segmentation training dataset. A semantic segmentation model is trained using this dataset to predict images, and the prediction results are evaluated using the average intersection-union comparison (AUC).

2. The weakly supervised semantic segmentation method based on random combination as described in claim 1, characterized in that, The classification network N1 is a ResNet50 network with 50 layers, including 49 convolutional layers and 1 fully connected layer. The input of the classification network N1 is a 224×224 image, and the output is a classification score of 1×1000. The cross-entropy loss calculated from the classification scores is then used to train the classification network N1. The classification network N2 uses the InceptionV3 network, which has 48 neural network layers, including 47 convolutional layers and one fully connected layer. The input is a 299×299 image, and the output is a classification score of 1×1000. The cross-entropy loss calculated from the classification scores is then used to train the classification network N2. The classification network N3 uses the DenseNet121 network, which has 121 layers, including 120 convolutional layers and one fully connected layer. The input is a 224×224 image, and the output is a classification score of 1×1000. The cross-entropy loss calculated from the classification scores is then used to train the classification network N3. The semantic segmentation network model uses the DeepLab V3 network.

3. The weakly supervised semantic segmentation method based on random combination as described in claim 1, characterized in that, The loss function used when training classification networks N1, N2, and N3 is the cross-entropy loss function, and the specific formula is as follows: , In the formula It is a real i-type tag. is the predicted label of class i, and C represents all categories.

4. The weakly supervised semantic segmentation method based on random combination as described in claim 1, characterized in that, The training classification network N2 also includes semantic supervision of the classification network N1. The squared difference between A1 and A2 is added as the loss to the loss function of the classification network N2, as shown in the following formula: 。 5. The weakly supervised semantic segmentation method based on random combination as described in claim 1, characterized in that, The training classification network N3 also includes semantic supervision from classification networks N1 and N2. The squared differences between A1 and A3 and the squared differences between A2 and A3 are added as losses to the loss function of classification network N3, with the specific formula as follows: 。 6. A weakly supervised semantic segmentation system based on random combination, implementing the weakly supervised semantic segmentation method based on random combination as described in any one of claims 1-5, characterized in that, The weakly supervised semantic segmentation system based on random combination includes: The preprocessing module is used to slice and randomly combine the training dataset to obtain sliced ​​training datasets and randomly combined sliced ​​training datasets. The training module is used to train classification networks N1, N2, and N3 based on the training dataset, the slice training dataset, and a random combination of slice training datasets, as well as to train a semantic segmentation model using the semantic segmentation training dataset. The prediction module is used to obtain semantic segmentation prediction results using classification networks and semantic segmentation models; The evaluation module is used to evaluate the final prediction results.

7. A computer device, characterized in that, The computer device includes a memory and a processor. The memory stores a computer program, which, when executed by the processor, causes the processor to perform the steps of the weakly supervised semantic segmentation method based on random combination as described in any one of claims 1-5.

8. A computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to perform the steps of the weakly supervised semantic segmentation method based on random combination as described in any one of claims 1-5.