Image segmentation method and device, and training method and device of image segmentation model
By using pre-trained graph convolutional neural network and generative adversarial network models, image features are extracted and transformed, solving the problem of inaccurate segmentation caused by insufficient labeled samples and achieving higher segmentation accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- QIANXUN SPATIAL INTELLIGENCE INC
- Filing Date
- 2021-12-20
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, deep learning segmentation algorithms require a large number of labeled samples to achieve good results, and the image segmentation results are not accurate enough when there are no labeled samples.
The sample image features are extracted by a pre-trained graph convolutional neural network model, and the text labels are converted into semantic feature maps by a generative adversarial network model. The similarity between the visual and semantic feature maps is calculated to update the parameters of the generative adversarial network model, and then the image is segmented.
It improves the segmentation accuracy of images with fewer labeled samples, reduces the dependence on labeled samples, and improves the segmentation effect.
Smart Images

Figure CN116342871B_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of image processing technology, and in particular relates to an image segmentation method, apparatus, and training method and apparatus for an image segmentation model. Background Technology
[0002] Currently, mainstream deep learning segmentation algorithms require a large number of labeled samples for model training to achieve good segmentation results. This extensive data labeling work is not only costly in terms of manpower and resources but also time-consuming. Furthermore, without labeled samples, current image segmentation methods produce inaccurate results. Summary of the Invention
[0003] This application provides an image segmentation method, apparatus, and training method and apparatus for an image segmentation model, which can solve the technical problem in related technologies that the image segmentation results for categories with a small number of labeled sample images are not accurate enough.
[0004] In a first aspect, embodiments of this application provide a method for training an image segmentation model, the method comprising:
[0005] By using a pre-trained graph convolutional neural network model in the image segmentation model, features of the first type of object set in the sample image are extracted to obtain multiple visual feature maps; wherein, the first type of object set includes multiple objects in the sample image whose segmentation results are pre-annotated.
[0006] Using a generative adversarial network model in the image segmentation model, the word vectors of the text labels of the first type of object set are converted into multiple semantic feature maps of the first type of object set; where the word vectors of the text labels are word vectors obtained by mapping the text labels.
[0007] The similarity is calculated by mapping multiple visual feature maps and multiple semantic feature maps to the target feature space through the loss function of the generative adversarial network model.
[0008] The parameters of the generative adversarial network model are updated using similarity as a constraint.
[0009] Optionally, similarity is calculated by mapping multiple visual feature maps and multiple semantic feature maps to the target feature space using the loss function of the generative adversarial network model, including:
[0010] The first feature vector is obtained by mapping multiple visual feature maps to the target feature space through the first nonlinear transformation.
[0011] The second nonlinear transformation maps multiple semantic feature maps of the first type of object set to the target feature space to obtain the second feature vector.
[0012] Calculate the similarity between the first eigenvector and the second eigenvector.
[0013] Optionally, before extracting features of the first class of objects in the sample image using a graph convolutional neural network model pre-trained in the image segmentation model, the method further includes:
[0014] The graph convolutional neural network model is trained by multiple sample images, and the images of multiple objects in the first type of object set are segmented to obtain the pre-trained graph convolutional neural network model.
[0015] Using a pre-trained graph convolutional neural network model, features of the first class of objects in the sample images are extracted, including:
[0016] Remove the fully connected layers from the pre-trained graph convolutional neural network model;
[0017] Use a graph convolutional neural network model with the fully connected layers removed to extract features of the first class of objects in the sample images.
[0018] Optionally, before converting the word vectors of the text labels of the first type of object set into multiple semantic feature maps of the first type of object set through a generative adversarial network model in the image segmentation model, the method further includes:
[0019] Retrieve the text labels of the first type of object set;
[0020] The text labels of the first type of object set are mapped to word vectors using a word embedding model;
[0021] Normalize the word vectors to obtain the word vectors of the text labels.
[0022] Optionally, after updating the parameters of the generative adversarial network model with similarity as a constraint, the method further includes:
[0023] The parameters of the pixel-level classifier in the image segmentation model are updated using multiple visual feature maps and multiple semantic feature maps of the first type of object set in the sample image; wherein, the pixel-level classifier is used to output the segmentation result based on multiple visual feature maps and multiple semantic feature maps.
[0024] Secondly, embodiments of this application provide an image segmentation method, which is used to perform image segmentation using an image segmentation model trained by the image segmentation model training method provided in the first aspect and any optional implementation thereof. The image segmentation method includes:
[0025] By using the graph convolutional neural network model in the image segmentation model, features of the first type of object set in the target image are extracted to obtain multiple visual feature maps;
[0026] Multiple semantic feature maps of the second type of object set are obtained through the generative adversarial network model in the image segmentation model; wherein the second type of object set includes the first type of object set and at least one new object outside the first type of object set;
[0027] By combining multiple visual feature maps of the first type of object set and multiple semantic feature maps of the second type of object set, a combined feature map is obtained.
[0028] The target image is classified at the pixel level by a classifier based on the combined feature map to obtain the image segmentation result of the target image; wherein, the image segmentation result represents the category of each pixel in the target image to which it belongs to the first object set or the second object set.
[0029] Optionally, multiple semantic feature maps of the second type of object set are obtained through a generative adversarial network model in the image segmentation model, including:
[0030] The text labels of the second type of object set are mapped to word vectors using a word embedding model;
[0031] Normalize the word vectors to obtain the word vectors of the text labels of the second type of object set;
[0032] By using a generative adversarial network model in the image segmentation model, the word vectors of the text labels of the second type of object set are converted into multiple semantic feature maps of the second type of object set.
[0033] Thirdly, embodiments of this application provide a training apparatus for an image segmentation model, the apparatus comprising:
[0034] The extraction unit is used to extract features of the first type of object set in the sample image through a pre-trained graph convolutional neural network model in the image segmentation model, and obtain multiple visual feature maps; wherein, the first type of object set includes multiple objects in the sample image whose segmentation results are pre-annotated;
[0035] The transformation unit is used to convert the word vectors of text labels of the first type of object set into multiple semantic feature maps of the first type of object set through the generative adversarial network model in the image segmentation model; wherein, the word vectors of the text labels are word vectors obtained by mapping the text labels;
[0036] The computational unit is used to map multiple visual feature maps and multiple semantic feature maps to the target feature space and calculate the similarity using the loss function of the generative adversarial network model.
[0037] The update unit is used to update the parameters of the generative adversarial network model with similarity as a constraint.
[0038] Optionally, the computing unit includes:
[0039] The first mapping subunit is used to map multiple visual feature maps to the target feature space through a first nonlinear transformation to obtain a first feature vector.
[0040] The second mapping subunit is used to map multiple semantic feature maps of the first type of object set to the target feature space through a second nonlinear transformation to obtain a second feature vector;
[0041] The first computational subunit is used to calculate the similarity between the first eigenvector and the second eigenvector.
[0042] Optionally, the device further includes:
[0043] The segmentation unit is used to perform image segmentation on multiple objects of the first object set by training the graph convolutional neural network model on multiple sample images before extracting features of the first object set in the sample image through the graph convolutional neural network model pre-trained in the image segmentation model, so as to obtain the pre-trained graph convolutional neural network model.
[0044] The extraction unit includes:
[0045] Delete sub-units, used to remove fully connected layers from pre-trained graph convolutional neural network models;
[0046] Extraction subunits are used to extract features of the first class of objects in the sample image using a graph convolutional neural network model with the fully connected layers removed.
[0047] Optionally, the device further includes:
[0048] The acquisition unit is used to acquire the text labels of the first type of object set before converting the word vectors of the text labels of the first type of object set into multiple semantic feature maps of the first type of object set through the generative adversarial network model in the image segmentation model.
[0049] The mapping unit is used to map the text labels of the first type of object set into word vectors through the word embedding model;
[0050] The normalization unit is used to perform a normalization operation on the word vectors to obtain the word vectors of the text labels.
[0051] Optionally, the update unit is also used to update the parameters of the pixel-level classifier in the image segmentation model after updating the parameters of the generative adversarial network model with similarity as a constraint, by using multiple visual feature maps and multiple semantic feature maps of the first class of objects in the sample image; wherein the pixel-level classifier is used to output the segmentation result based on multiple visual feature maps and multiple semantic feature maps.
[0052] Fourthly, embodiments of this application provide an image segmentation apparatus for performing image segmentation using an image segmentation model trained by the training method of the image segmentation model provided in the first aspect and any optional implementation thereof. The image segmentation apparatus includes:
[0053] The extraction unit is used to extract features of the first type of object set in the target image through the graph convolutional neural network model in the image segmentation model, and obtain multiple visual feature maps.
[0054] The acquisition unit is used to acquire multiple semantic feature maps of a second type of object set through a generative adversarial network model in the image segmentation model; wherein the second type of object set includes the first type of object set and at least one new object outside the first type of object set;
[0055] The combination unit is used to combine multiple visual feature maps of the first type of object set and multiple semantic feature maps of the second type of object set to obtain a combined feature map.
[0056] The classification unit is used to perform pixel-level classification of the target image based on the combined feature map by a classifier to obtain the image segmentation result of the target image; wherein, the image segmentation result represents the category of each pixel in the target image belonging to the first object set or the second object set.
[0057] Optionally, the acquisition unit includes:
[0058] The mapping subunit is used to map text labels of the second type of object set to word vectors through a word embedding model;
[0059] The normalization subunit is used to perform a normalization operation on the word vectors to obtain the word vectors of the text labels of the second type of object set;
[0060] The transformation subunit is used to transform the word vectors of text labels of the second type of object set into multiple semantic feature maps of the second type of object set through the generative adversarial network model in the image segmentation model.
[0061] Fifthly, embodiments of this application provide an electronic device, which includes: a processor and a memory storing program instructions; the processor executes the program instructions to implement the method described in the first or second aspect.
[0062] In a sixth aspect, embodiments of this application provide a readable storage medium storing program instructions that, when executed by a processor, implement the method described in the first or second aspect.
[0063] In a seventh aspect, embodiments of this application provide a program product in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method described in the first or second aspect.
[0064] The image segmentation method, apparatus, and training method, apparatus, electronic device, readable storage medium, and program product of this application extract features of a first type of object set from a sample image using a pre-trained graph convolutional neural network model in the image segmentation model, obtaining multiple visual feature maps. The first type of object set includes multiple objects in the sample image whose segmentation results are pre-annotated. The generative adversarial network model in the image segmentation model converts the word vectors of the text labels of the first type of object set into multiple semantic feature maps. Then, the loss function of the generative adversarial network model maps the multiple visual feature maps and multiple semantic feature maps to the target feature space to calculate similarity. Thus, the parameters of the generative adversarial network model can be updated using similarity as a constraint. According to the embodiments of this application, the technical problem of inaccurate image segmentation results for categories with few labeled sample images in related technologies can be solved, improving the accuracy of image segmentation for categories with few labeled sample images. Attached Figure Description
[0065] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments of this application will be briefly introduced below. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0066] Figure 1 This is a flowchart illustrating a training method for an image segmentation model provided in an optional embodiment of this application;
[0067] Figure 2 This is a flowchart illustrating the training method of an image segmentation model provided in another optional embodiment of this application;
[0068] Figure 3 This is a schematic flowchart of an image segmentation method provided in one embodiment of this application;
[0069] Figure 4 This is a schematic diagram of the structure of a training device for an image segmentation model provided in one embodiment of this application;
[0070] Figure 5 This is a schematic diagram of the structure of an image segmentation apparatus provided in one embodiment of this application;
[0071] Figure 6 This is a schematic diagram of the structure of an electronic device provided in another embodiment of this application. Detailed Implementation
[0072] The features and exemplary embodiments of various aspects of this application will be described in detail below. To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are only intended to explain this application and not to limit it. For those skilled in the art, this application can be implemented without some of these specific details. The following description of the embodiments is merely to provide a better understanding of this application by illustrating examples.
[0073] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising..." does not exclude the presence of additional identical elements in the process, method, article, or apparatus that includes said element.
[0074] To address the problems of the prior art, embodiments of this application provide an image segmentation method, apparatus, and a training method, apparatus, electronic device, readable storage medium, and program product for an image segmentation model. The training method for the image segmentation model provided in this application embodiment will be described first below.
[0075] Figure 1 A flowchart illustrating a training method for an image segmentation model provided in one embodiment of this application is shown. Figure 1 As shown, the method includes the following steps:
[0076] Step 101: Extract features of the first type of object set in the sample image using the pre-trained graph convolutional neural network model in the image segmentation model to obtain multiple visual feature maps.
[0077] In this embodiment, the image segmentation model includes a graph convolutional neural network model and a generative adversarial network model. Optionally, it may also include a classifier, specifically a pixel-level classifier.
[0078] Graph convolutional neural network (GCNN) models are pre-trained neural network models used to extract features from a first set of objects in an image. This first set includes multiple objects in the sample image that have been pre-labeled with segmentation results. In other words, the first set consists of objects with labeled samples. A trained GCNN model can typically achieve good feature extraction results for objects with labeled samples.
[0079] After an image is processed by a graph convolutional neural network model to extract features, multiple visual feature maps can be obtained. These visual feature maps correspond one-to-one with the features of multiple objects extracted from the image.
[0080] In an optional embodiment of this application, before performing step 101 to extract features of the first class of objects in the sample image using a pre-trained graph convolutional neural network model in the image segmentation model, the following step 1011 may be included:
[0081] Step 1011: Train a graph convolutional neural network model using multiple sample images to perform image segmentation on multiple objects in the first type of object set, thereby obtaining a pre-trained graph convolutional neural network model.
[0082] Accordingly, step 1011, which extracts features of the first class of objects in the sample image using a pre-trained graph convolutional neural network model, may include performing the following steps 1012-1013:
[0083] Step 1012: Remove the fully connected layers of the pre-trained graph convolutional neural network model;
[0084] Step 1013: Use a graph convolutional neural network model with the fully connected layers removed to extract features of the first class of objects in the sample images.
[0085] In the above optional implementation, the graph convolutional neural network model may include fully connected layers. After the graph convolutional neural network model is trained, the fully connected layers in the model can be removed. In this way, after the image is input into the graph convolutional neural network model with the fully connected layers removed, the features of the first type of object set in the sample image can be obtained, that is, multiple visual feature maps corresponding to multiple objects in the first type of object set.
[0086] Step 102: Using the generative adversarial network model in the image segmentation model, the word vectors of the text labels of the first type of object set are converted into multiple semantic feature maps of the first type of object set.
[0087] The word vectors for text labels are obtained by mapping the text labels. Each object in the first type of object set is assigned a corresponding text label, such as sky, grass, trees, houses, etc. Optionally, the text labels can be natural language words of any language; for example, text labels can be English words. After being converted into word vectors, the text labels can be represented by vectors. The word vectors corresponding to different English words have different spatial distances depending on the degree of correlation between the English words. The closer the meanings of the words, the closer the spatial distance between their mapped word vectors. This is obtained by training a mapping model.
[0088] Text tags can be mapped to word vectors based on a pre-trained mapping model. Optionally, the pre-trained mapping model can be a word embedding model, such as word2vec.
[0089] In one example, the word vectors of the text labels can be pre-mapped and saved according to the mapping model, and it is only necessary to find the word vector corresponding to the text label.
[0090] In another example, the word vectors for text labels can be calculated each time based on the text label using a mapping model.
[0091] Optionally, before performing step 102, which uses a generative adversarial network model in the image segmentation model to convert the word vectors of the text labels of the first type of object set into multiple semantic feature maps of the first type of object set, the text labels of the first type of object set can be obtained first. Then, the text labels of the first type of object set can be mapped to word vectors using a word embedding model. Subsequently, a normalization operation is performed on the word vectors to obtain the word vectors of the text labels. The normalization operation ensures that each element in the word vector corresponding to each text label is between 0 and 1.
[0092] Generative Adversarial Networks (GANs) are deep learning models that learn and produce outputs through an interplay between at least two modules: a generative model and a discriminative model. When converting word vectors of text labels for a first-class object set into multiple semantic feature maps for that object set, this can be achieved through the generative model within a GAN, resulting in multiple semantic feature maps corresponding to multiple objects within the first-class object set.
[0093] Step 103: The similarity is calculated by mapping multiple visual feature maps and multiple semantic feature maps to the target feature space through the loss function of the generative adversarial network model.
[0094] After the generative adversarial network model generates multiple semantic feature maps, a loss function is used to calculate the similarity between the multiple visual feature maps and the multiple semantic feature maps. Specifically, the loss function needs to map the multiple visual feature maps and the multiple semantic feature maps to the target feature space, and then calculate the similarity between the visual feature map of each object and the corresponding semantic feature map in the target feature space.
[0095] In one example, the loss function can be constructed as follows:
[0096]
[0097] Where x′ is the visual feature map, The visual feature map is represented by φ and φ′, which represent two different unit linear or nonlinear transformations, respectively. Through the transformations of φ and φ′, the visual feature map and multiple semantic feature maps are mapped to the target feature space, and the similarity L is calculated. G (s).
[0098] Step 104: Update the parameters of the generative adversarial network model with similarity as a constraint.
[0099] After calculating the similarity, the model parameters in the generative adversarial network (GAN) model are updated using the similarity as a constraint. Specifically, the parameter updates can be achieved using model parameter update algorithms from related technologies. For example, updating the parameters of the GAN model can use forward propagation or backpropagation algorithms from related technologies.
[0100] Specifically, in an optional implementation, step 103, which maps multiple visual feature maps and multiple semantic feature maps to the target feature space and calculates similarity using the loss function of the generative adversarial network model, may include the following steps:
[0101] Step 1031: Map multiple visual feature maps to the target feature space through a first nonlinear transformation to obtain a first feature vector.
[0102] For example, the visual feature map x′ can be mapped to the target feature space using φ′(x′).
[0103] Step 1032: Map multiple semantic feature maps of the first type of object set to the target feature space through a second nonlinear transformation to obtain the second feature vector.
[0104] For example, it can be done by semantic feature maps Mapped to the target feature space.
[0105] Step 1033: Calculate the similarity between the first feature vector and the second feature vector.
[0106] Similarity can be obtained through express.
[0107] The image segmentation model in this embodiment may further include a pixel-level classifier. Accordingly, after performing step 104 to update the parameters of the generative adversarial network model with similarity as a constraint, the parameters of the pixel-level classifier in the image segmentation model may be updated using multiple visual feature maps and multiple semantic feature maps of the first class of objects in the sample image. The pixel-level classifier is used to output the segmentation result based on the multiple visual feature maps and multiple semantic feature maps.
[0108] Here, the multiple visual feature maps are multiple visual feature maps of the first type of object set. The multiple semantic feature maps are semantic feature maps of objects in the second type of object set other than the first type of object set. The second type of object set includes the first type of object set, multiple objects outside the first type of object set including the first type of object set, and at least one new object whose segmentation result is not labeled in the sample image.
[0109] During the debugging phase of updating the parameters of the pixel-level classifier, multiple visual feature maps of the first type of object set and semantic feature maps of objects other than the first type of object set in the second type of object set can be recombined. The recombined pixel-level feature maps are used as input information for the classifier model. Then, the classifier fine-tunes the model based on the input information, thereby realizing knowledge transfer from the known sample category to the zero sample category.
[0110] Optionally, the pixel-level classifier may include a newly created fully connected layer of the graph convolutional neural network. Specifically, after removing the fully connected layer of the graph convolutional neural network, a new fully connected layer can be created and used as a classifier. Accordingly, the parameters of the classifier (the newly created fully connected layer) can be updated by recombining the pixel-level feature maps, thereby enabling the debugging of the classifier.
[0111] The image segmentation model training method of this application embodiment extracts features of a first type of object set from the sample image using a pre-trained graph convolutional neural network model in the image segmentation model, obtaining multiple visual feature maps. The first type of object set includes multiple objects in the sample image whose segmentation results are pre-annotated. The generative adversarial network model in the image segmentation model converts the word vectors of the text labels of the first type of object set into multiple semantic feature maps. Then, the loss function of the generative adversarial network model maps the multiple visual feature maps and multiple semantic feature maps to the target feature space to calculate similarity. Thus, the parameters of the generative adversarial network model can be updated using similarity as a constraint. According to this application embodiment, the technical problem of inaccurate image segmentation results for categories with few labeled sample images in related technologies can be solved, improving the accuracy of image segmentation for categories with few labeled sample images.
[0112] The following is combined with Figure 2 An optional specific implementation of the training method for the image segmentation model provided in the embodiments of this application will be described.
[0113] The first module is model training. This section focuses on training the GAN model.
[0114] Specifically, firstly, by inputting an image into a pre-trained graph convolutional neural network model that identifies the first type of object set, visual feature maps of the first type of object set (known categories) are obtained.
[0115] Furthermore, the text category information (i.e., text labels) of the first type of object set is converted into word vectors through a word vectorization generator (e.g., a word embedding model), and the word vectors of the first type of object set are input into the GAN network to obtain the semantic feature map of the first type of object set.
[0116] Next, after obtaining the visual feature maps and semantic feature maps of the first set of objects, the GAN network model can be updated to achieve the training of the GAN network model.
[0117] The second module is classifier debugging. In the classifier debugging section, the parameters of the classifier are adjusted based on the reconstructed feature map.
[0118] Specifically, after updating the GAN network model in the first module, an updated semantic feature map is generated based on the updated GAN network model. Within the updated semantic feature map, the semantic feature map containing zero samples added relative to the visual feature map is selected.
[0119] After recombining the visual feature maps of known categories with the semantic feature maps of newly added zero-sample categories, the recombined feature map is input into the classifier for discrimination, and the classifier is updated (adjusted) based on the discrimination results.
[0120] Finally, the classifier can output the image segmentation results.
[0121] The image segmentation model training method provided in this application does not require manual annotation of samples for newly added categories. Furthermore, the loss function metric between the semantic feature maps generated by the GAN network and the visual feature maps generated by the graph convolutional neural network model employs a different mapping strategy than related technologies. Instead of directly calculating the difference between feature maps, it maps these two feature maps to the target feature space using linear or nonlinear transformations for similarity loss calculation. This loss function construction method allows the GAN network feature map generation to converge faster and better. In addition, by introducing the graph convolutional neural network model, the extracted visual feature maps can maintain stable relative positional relationships, and the graph structure allows for more accurate pixel category prediction.
[0122] Figure 3 A schematic flowchart of an image segmentation method according to an embodiment of this application is shown. This image segmentation method can perform image segmentation using an image segmentation model trained by the image segmentation model training method provided in this application. Since the image segmentation model is trained by the image segmentation model training method provided in this application, any parts of the image segmentation method not described in detail in this application can be referred to the relevant descriptions in the image segmentation model training method provided in this application, and will not be repeated here.
[0123] like Figure 3 As shown, the image segmentation method may specifically include the following steps:
[0124] Step 201: Extract features of the first type of object set in the target image using the graph convolutional neural network model in the image segmentation model to obtain multiple visual feature maps.
[0125] Step 202: Obtain multiple semantic feature maps of the second type of object set through the generative adversarial network model in the image segmentation model.
[0126] The second type of object set includes the first type of object set and at least one new object outside the first type of object set.
[0127] Step 203: Combine multiple visual feature maps of the first type of object set and multiple semantic feature maps of the second type of object set to obtain a combined feature map.
[0128] Step 204: The target image is classified at the pixel level by a classifier based on the combined feature map to obtain the image segmentation result of the target image.
[0129] The image segmentation result represents the category of each pixel in the target image belonging to either the first or second object set.
[0130] Optionally, multiple semantic feature maps of the second type of object set are obtained through a generative adversarial network model in the image segmentation model, including:
[0131] The text labels of the second type of object set are mapped to word vectors using a word embedding model;
[0132] Normalize the word vectors to obtain the word vectors of the text labels of the second type of object set;
[0133] By using a generative adversarial network model in the image segmentation model, the word vectors of the text labels of the second type of object set are converted into multiple semantic feature maps of the second type of object set.
[0134] The image segmentation method of this application embodiment can segment images using an image segmentation model trained by an image segmentation model training method. By recombining multiple visual feature maps of a first type of object set and multiple semantic feature maps of a second type of object set to obtain a combined feature map, and performing image segmentation using the combined feature map, more accurate segmentation results can be obtained using the visual feature map of the first type of object. This solves the technical problem in related technologies where the image segmentation results for categories with few labeled sample images are not accurate enough, and improves the accuracy of image segmentation for categories with few labeled sample images.
[0135] Figure 4 This diagram illustrates the structure of a training apparatus for an image segmentation model according to an embodiment of this application. The training apparatus for an image segmentation model provided in this embodiment can be used to execute the training method for the image segmentation model provided in this embodiment. For parts not detailed in the embodiments of the training apparatus for the image segmentation model provided in this embodiment, please refer to the descriptions in the embodiments of the training method for the image segmentation model provided in this embodiment.
[0136] like Figure 4 As shown, the training device for the image segmentation model provided in this application embodiment includes an extraction unit 11, a conversion unit 12, a calculation unit 13, and an update unit 14.
[0137] The extraction unit 11 is used to extract features of the first type of object set in the sample image through the pre-trained graph convolutional neural network model in the image segmentation model, and obtain multiple visual feature maps; wherein, the first type of object set includes multiple objects in the sample image whose segmentation results are pre-annotated;
[0138] The conversion unit 12 is used to convert the word vectors of the text labels of the first type of object set into multiple semantic feature maps of the first type of object set through the generative adversarial network model in the image segmentation model; wherein, the word vectors of the text labels are the word vectors obtained by mapping the text labels;
[0139] The computation unit 13 is used to map multiple visual feature maps and multiple semantic feature maps to the target feature space respectively through the loss function of the generative adversarial network model to calculate the similarity.
[0140] Update unit 14 is used to update the parameters of the generative adversarial network model with similarity as a constraint.
[0141] Optionally, the computing unit 13 may include:
[0142] The first mapping subunit is used to map multiple visual feature maps to the target feature space through a first nonlinear transformation to obtain a first feature vector.
[0143] The second mapping subunit is used to map multiple semantic feature maps of the first type of object set to the target feature space through a second nonlinear transformation to obtain a second feature vector;
[0144] The first computational subunit is used to calculate the similarity between the first eigenvector and the second eigenvector.
[0145] Optionally, the device may further include:
[0146] The segmentation unit is used to perform image segmentation on multiple objects of the first object set by training the graph convolutional neural network model on multiple sample images before extracting features of the first object set in the sample image through the graph convolutional neural network model pre-trained in the image segmentation model, so as to obtain the pre-trained graph convolutional neural network model.
[0147] Accordingly, the extraction unit 11 may include:
[0148] Delete sub-units, used to remove fully connected layers from pre-trained graph convolutional neural network models;
[0149] Extraction subunits are used to extract features of the first class of objects in the sample image using a graph convolutional neural network model with the fully connected layers removed.
[0150] Optionally, the device may further include:
[0151] The acquisition unit is used to acquire the text labels of the first type of object set before converting the word vectors of the text labels of the first type of object set into multiple semantic feature maps of the first type of object set through the generative adversarial network model in the image segmentation model.
[0152] The mapping unit is used to map the text labels of the first type of object set into word vectors through the word embedding model;
[0153] The normalization unit is used to perform a normalization operation on the word vectors to obtain the word vectors of the text labels.
[0154] Optionally, after updating the parameters of the generative adversarial network model with similarity as a constraint, the updating unit 14 can also update the parameters of the pixel-level classifier in the image segmentation model using multiple visual feature maps and multiple semantic feature maps of the first class of objects in the sample image; wherein, the pixel-level classifier is used to output the segmentation result based on multiple visual feature maps and multiple semantic feature maps.
[0155] The training apparatus for the image segmentation model in this application embodiment extracts features of a first type of object set from a sample image using a pre-trained graph convolutional neural network model in the image segmentation model, obtaining multiple visual feature maps. The first type of object set includes multiple objects in the sample image whose segmentation results are pre-annotated. The generative adversarial network model in the image segmentation model converts the word vectors of the text labels of the first type of object set into multiple semantic feature maps. Then, the loss function of the generative adversarial network model maps the multiple visual feature maps and multiple semantic feature maps to the target feature space to calculate similarity. Thus, the parameters of the generative adversarial network model can be updated using similarity as a constraint. According to this application embodiment, the technical problem of inaccurate image segmentation results for categories with few labeled sample images in related technologies can be solved, improving the accuracy of image segmentation for categories with few labeled sample images.
[0156] Figure 5 This diagram illustrates the structure of an image segmentation apparatus according to an embodiment of this application. The image segmentation apparatus provided in this embodiment can be used to execute the image segmentation method provided in this embodiment. For parts of the image segmentation apparatus not described in detail in the embodiments of the image segmentation apparatus provided in this embodiment, please refer to the descriptions in the embodiments of the image segmentation method provided in this embodiment.
[0157] like Figure 5 As shown, the image segmentation apparatus provided in this application embodiment includes an extraction unit 21, an acquisition unit 22, a combination unit 23, and a classification unit 24.
[0158] The image segmentation device is used to perform image segmentation using an image segmentation model trained by the image segmentation model training method provided in the embodiments of this application.
[0159] Extraction unit 21 is used to extract features of the first type of object set in the target image through the graph convolutional neural network model in the image segmentation model, and obtain multiple visual feature maps;
[0160] The acquisition unit 22 is used to acquire multiple semantic feature maps of the second type of object set through the generative adversarial network model in the image segmentation model; wherein, the second type of object set includes the first type of object set and at least one new object outside the first type of object set;
[0161] Combination unit 23 is used to combine multiple visual feature maps of the first type of object set and multiple semantic feature maps of the second type of object set to obtain a combined feature map;
[0162] The classification unit 24 is used to perform pixel-level classification of the target image based on the combined feature map by the classifier to obtain the image segmentation result of the target image; wherein, the image segmentation result represents the category of each pixel in the target image belonging to the first object set or the second object set.
[0163] Optionally, the acquisition unit 22 may include:
[0164] The mapping subunit is used to map text labels of the second type of object set to word vectors through a word embedding model;
[0165] The normalization subunit is used to perform a normalization operation on the word vectors to obtain the word vectors of the text labels of the second type of object set;
[0166] The transformation subunit is used to transform the word vectors of text labels of the second type of object set into multiple semantic feature maps of the second type of object set through the generative adversarial network model in the image segmentation model.
[0167] The image segmentation apparatus of this application embodiment can segment images using an image segmentation model trained by an image segmentation model training method. By recombining multiple visual feature maps of a first type of object set and multiple semantic feature maps of a second type of object set to obtain a combined feature map, and performing image segmentation using the combined feature map, more accurate segmentation results can be obtained using the visual feature maps of the first type of object. This solves the technical problem in related technologies where image segmentation results for categories with few labeled sample images are not accurate enough, and improves the accuracy of image segmentation for categories with few labeled sample images.
[0168] Figure 6 A schematic diagram of the hardware structure of the electronic device provided in an embodiment of this application is shown.
[0169] The electronic device may include a processor 301 and a memory 302 storing program instructions.
[0170] Specifically, the processor 301 may include a central processing unit (CPU), an application-specific integrated circuit (ASIC), or one or more integrated circuits that can be configured to implement the embodiments of this application.
[0171] Memory 302 may include mass storage for data or instructions. For example, and not limitingly, memory 302 may include a hard disk drive (HDD), floppy disk drive, flash memory, optical disk, magneto-optical disk, magnetic tape, or Universal Serial Bus (USB) drive, or a combination of two or more of these. Where appropriate, memory 302 may include removable or non-removable (or fixed) media. Where appropriate, memory 302 may be internal or external to the integrated gateway disaster recovery device. In a particular embodiment, memory 302 is non-volatile solid-state memory.
[0172] In a particular embodiment, memory 302 includes read-only memory (ROM). Where appropriate, the ROM may be a mask-programmed ROM, a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), an electrically rewritable ROM (EAROM), or flash memory, or a combination of two or more of these.
[0173] Memory may include read-only memory (ROM), random access memory (RAM), disk-readable storage media devices, optically readable storage media devices, flash memory devices, and electrical, optical, or other physical / tangible memory storage devices. Therefore, typically, memory includes one or more tangible (non-transitory) readable storage media (e.g., memory devices) encoded with software including computer-executable instructions, and when the software is executed (e.g., by one or more processors), it is operable to perform the operations described with reference to the method according to one aspect of this application.
[0174] The processor 301 implements any of the methods described in the above embodiments by reading and executing program instructions stored in the memory 302.
[0175] In one example, the electronic device may also include a communication interface 303 and a bus 310. For example, Figure 6 As shown, the processor 301, memory 302, and communication interface 303 are connected through bus 310 and complete communication with each other.
[0176] The communication interface 303 is mainly used to realize communication between various modules, devices, units and / or equipment in the embodiments of this application.
[0177] Bus 310 includes hardware, software, or both, that couples components of an electronic device together. For example, and not limitingly, the bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), HyperTransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an Infinite Bandwidth Interconnect, a Low Pin Count (LPC) bus, a memory bus, a Microchannel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a Video Electronics Standards Association Local (VLB) bus, or other suitable buses, or combinations of two or more of these. Where appropriate, bus 310 may include one or more buses. Although specific buses are described and illustrated in embodiments of this application, this application contemplates any suitable bus or interconnect.
[0178] In conjunction with the methods in the above embodiments, this application embodiment can provide a readable storage medium for implementation. This readable storage medium stores program instructions; when executed by a processor, these program instructions implement any of the methods in the above embodiments.
[0179] It should be clarified that this application is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of this application is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of this application.
[0180] The functional blocks shown in the above-described structural diagram can be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, they can be, for example, electronic circuits, application-specific integrated circuits (ASICs), appropriate firmware, plug-ins, function cards, etc. When implemented in software, the elements of this application are programs or code segments used to perform the required tasks. Programs or code segments can be stored on a machine-readable medium or transmitted over a transmission medium or communication link via data signals carried on a carrier wave. "Machine-readable medium" can include any medium capable of storing or transmitting information. Examples of machine-readable media include electronic circuits, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio frequency (RF) links, etc. Code segments can be downloaded via computer networks such as the Internet, intranets, etc.
[0181] It should also be noted that the exemplary embodiments mentioned in this application describe methods or systems based on a series of steps or apparatus. However, this application is not limited to the order of the above steps; that is, the steps can be performed in the order mentioned in the embodiments, or in a different order, or several steps can be performed simultaneously.
[0182] The aspects of this application have been described above with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and program products according to embodiments of this application. It should be understood that each block in the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by program instructions. These program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to create a machine such that these instructions, executable via the processor of the computer or other programmable data processing apparatus, enable the implementation of the functions / actions specified in one or more blocks of the flowchart illustrations and / or block diagrams. Such a processor can be, but is not limited to, a general-purpose processor, a special-purpose processor, a special application processor, or a field-programmable logic circuit. It is also understood that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can also be implemented by special-purpose hardware performing the specified functions or actions, or can be implemented by a combination of special-purpose hardware and computer instructions.
[0183] The above description is merely a specific implementation of this application. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, modules, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here. It should be understood that the protection scope of this application is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in this application, and these modifications or substitutions should all be covered within the protection scope of this application.
Claims
1. A training method for an image segmentation model, characterized in that, include: By using a pre-trained graph convolutional neural network model in the image segmentation model, features of the first type of object set in the sample image are extracted to obtain multiple visual feature maps; wherein, the first type of object set includes multiple objects in the sample image whose segmentation results are pre-annotated. The generative adversarial network model in the image segmentation model converts the word vectors of the text labels of the first type of object set into multiple semantic feature maps of the first type of object set; wherein, the word vectors of the text labels are word vectors obtained by mapping the text labels. The loss function of the generative adversarial network model is used to map the multiple visual feature maps and the multiple semantic feature maps to the target feature space to calculate the similarity. The parameters of the generative adversarial network model are updated using the similarity as a constraint. After updating the parameters of the generative adversarial network model using the aforementioned similarity as a constraint, the method further includes: Multiple visual feature maps from the first object set and semantic feature maps of objects from the second object set excluding the first object set are recombined. The recombined pixel-level feature maps are used as input information for the pixel-level classifier model to update the parameters of the pixel-level classifier model. The second type of object set includes the first type of object set and at least one new object outside the first type of object set.
2. The method according to claim 1, characterized in that, The step of mapping the multiple visual feature maps and the multiple semantic feature maps to the target feature space and calculating similarity using the loss function of the generative adversarial network model includes: The first feature vector is obtained by mapping the plurality of visual feature maps to the target feature space through a first nonlinear transformation. The second feature vector is obtained by mapping multiple semantic feature maps of the first type of object set to the target feature space through a second nonlinear transformation. Calculate the similarity between the first feature vector and the second feature vector.
3. The method according to claim 1, characterized in that, Before extracting features of the first class of objects in the sample image using a graph convolutional neural network model pre-trained in an image segmentation model, the method further includes: The graph convolutional neural network model is trained by multiple sample images, and the images of multiple objects in the first type of object set are segmented respectively to obtain the pre-trained graph convolutional neural network model. The process of extracting features of the first class of objects from sample images using a pre-trained graph convolutional neural network model includes: Remove the fully connected layers from the pre-trained graph convolutional neural network model; Features of the first class of objects in the sample images are extracted using a graph convolutional neural network model with the fully connected layers removed.
4. The method according to claim 1, characterized in that, Before converting the word vectors of the text labels of the first class of object sets into multiple semantic feature maps of the first class of object sets through a generative adversarial network model in the image segmentation model, the method further includes: Retrieve the text labels of the first type of object set; The text labels of the first type of object set are mapped to word vectors using a word embedding model; Normalization is performed on the word vectors to obtain the word vectors of the text tags.
5. The method according to claim 1, characterized in that, After updating the parameters of the generative adversarial network model using the aforementioned similarity as a constraint, the method further includes: The parameters of the pixel-level classifier in the image segmentation model are updated using the multiple visual feature maps and multiple semantic feature maps of the first type of object set in the sample image; wherein, the pixel-level classifier is used to output the segmentation result based on the multiple visual feature maps and multiple semantic feature maps.
6. An image segmentation method, characterized in that, The image segmentation method is used to perform image segmentation using an image segmentation model trained by the training method of any one of claims 1-5, and the image segmentation method includes: By using the graph convolutional neural network model in the image segmentation model, features of the first type of object set in the target image are extracted to obtain multiple visual feature maps; The generative adversarial network model in the image segmentation model is used to obtain multiple semantic feature maps of the second type of object set; wherein, the second type of object set includes the first type of object set and at least one new object outside the first type of object set; By combining multiple visual feature maps of the first type of object set and multiple semantic feature maps of the second type of object set, a combined feature map is obtained; The target image is classified at the pixel level by a classifier based on the combined feature map to obtain the image segmentation result of the target image; wherein the image segmentation result represents the category of each pixel in the target image belonging to the first object set or the second object set.
7. The method according to claim 6, characterized in that, The step of obtaining multiple semantic feature maps of the second type of object set through the generative adversarial network model in the image segmentation model includes: The text labels of the second type of object set are mapped to word vectors using a word embedding model; Normalize the word vectors to obtain the word vectors of the text labels of the second type of object set; The generative adversarial network model in the image segmentation model converts the word vectors of the text labels of the second type of object set into multiple semantic feature maps of the second type of object set.
8. A training device for an image segmentation model, characterized in that, include: The extraction unit is used to extract features of a first type of object set in a sample image through a pre-trained graph convolutional neural network model in the image segmentation model, and obtain multiple visual feature maps; wherein, the first type of object set includes multiple objects in the sample image whose segmentation results are pre-annotated; The conversion unit is used to convert the word vectors of the text labels of the first type of object set into multiple semantic feature maps of the first type of object set through the generative adversarial network model in the image segmentation model; wherein, the word vectors of the text labels are word vectors obtained by mapping the text labels; The computing unit is used to map the plurality of visual feature maps and the plurality of semantic feature maps to the target feature space respectively through the loss function of the generative adversarial network model to calculate the similarity. An update unit is used to update the parameters of the generative adversarial network model with the similarity as a constraint. The updating unit is further configured to reassemble multiple visual feature maps of the first type of object set and semantic feature maps of objects other than the first type of object set in the second type of object set, and use the reassembled pixel-level feature map as input information of the pixel-level classifier model to update the parameters of the pixel-level classifier model. The second type of object set includes the first type of object set and at least one new object outside the first type of object set.
9. An image segmentation apparatus, characterized in that, The image segmentation device is used to perform image segmentation using an image segmentation model trained by the training method of any one of claims 1-5, and the image segmentation device comprises: The extraction unit is used to extract features of the first type of object set in the target image through the graph convolutional neural network model in the image segmentation model to obtain multiple visual feature maps; The acquisition unit is used to acquire multiple semantic feature maps of a second type of object set through the generative adversarial network model in the image segmentation model; wherein the second type of object set includes a first type of object set and at least one new object outside the first type of object set; The combination unit is used to combine multiple visual feature maps of the first type of object set and multiple semantic feature maps of the second type of object set to obtain a combined feature map; A classification unit is used to perform pixel-level classification of the target image based on the combined feature map by a classifier to obtain an image segmentation result of the target image; wherein the image segmentation result represents the category of each pixel in the target image belonging to the first object set or the second object set.
10. The apparatus according to claim 9, characterized in that, The acquisition unit includes: The mapping subunit is used to map the text labels of the second type of object set into word vectors through a word embedding model; The normalization subunit is used to perform a normalization operation on the word vectors to obtain the word vectors of the text tags of the second type of object set; The transformation subunit is used to transform the word vectors of the text labels of the second type of object set into multiple semantic feature maps of the second type of object set through the generative adversarial network model in the image segmentation model.
11. An electronic device, characterized in that, The electronic device includes: a processor and a memory storing computer program instructions; When the processor executes the computer program instructions, it implements the method as described in any one of claims 1-7.
12. A readable storage medium, characterized in that, The readable storage medium stores computer program instructions that, when executed by a processor, implement the method as described in any one of claims 1-7.
Citation Information
Patent Citations
Model training method and device, image processing method and device, computer equipment and storage medium
CN112132197A