Attention-based few-sample segmentation methods, devices, terminals, and media
By using an attention-based few-shot segmentation method, complementary prototypes are generated and image segmentation is performed using an FPN-structured decoder. This solves the problems of large data requirements and low accuracy in semantic segmentation of deep learning models, and achieves high-precision few-shot segmentation results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- PENG CHENG LAB
- Filing Date
- 2022-12-08
- Publication Date
- 2026-06-30
AI Technical Summary
Existing deep learning models require a large amount of labeled data for semantic segmentation, which makes data collection difficult. Furthermore, traditional methods have low accuracy in judging pixel features and prototype categories, especially when the appearance of the category is variable and the pose is different.
We employ a few-sample segmentation method based on an attention mechanism. We extract features through a weight-shared encoder to generate complementary prototypes, and use an FPN-structured decoder for stitching and background prediction. We also combine bilinear interpolation to restore image size and reduce classification errors caused by prototype bias.
It improves the accuracy of pixel feature and prototype category judgment, reduces classification errors, and can effectively segment images under small sample conditions.
Smart Images

Figure CN116258937B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of deep learning technology, and in particular to a few-shot segmentation method, device, terminal, and medium based on an attention mechanism. Background Technology
[0002] With the development of deep learning and the explosive growth of large datasets, deep learning-based methods have demonstrated powerful representation and generalization capabilities, rapidly sweeping across many subfields of computer vision and achieving outstanding results. However, deep learning models are data-starved algorithms, often requiring large amounts of pixel-level precisely labeled data in semantic segmentation scenarios. This poses a challenge to data collection and thus limits the application of deep learning models in semantic segmentation and related fields. For example, in computer-aided diagnostic systems, deep network models require extensive labeling, but doctors must dedicate significant time to patients and lack sufficient resources and manpower for data labeling. Therefore, few-shot semantic segmentation methods hold great promise for solving this problem, as they can achieve good segmentation results without requiring a large amount of labeled data.
[0003] Existing mainstream methods often use feature extraction and mask pooling algorithms to allow the model to extract the prototype corresponding to the category, and then use K-nearest neighbor classification or discriminant functions to determine whether the pixel features and the prototype belong to the same category. Traditional methods using a single semantic category are insufficient to represent the implicit semantic information in the foreground objects. When encountering categories with varied appearances, as well as different parts or poses shown in different images, this representation method becomes inadequate, making it impossible to accurately determine whether the pixel features and the prototype belong to the same category.
[0004] Therefore, existing technologies still need improvement. Summary of the Invention
[0005] The technical problem to be solved by the present invention is that, in view of the defects of the prior art, the present invention provides a few-sample segmentation method, device, terminal and medium based on attention mechanism to solve the technical problem of low accuracy of existing methods in classifying pixel features and prototypes.
[0006] The technical solution adopted by this invention to solve the technical problem is as follows:
[0007] In a first aspect, the present invention provides a few-sample segmentation method based on an attention mechanism, comprising:
[0008] Input the supporting image, the mask image corresponding to the supporting image, and the query image to be predicted;
[0009] The features of the query image to be predicted and the features of the supporting images are extracted by a weight-sharing encoder.
[0010] The features of the supporting image and the mask image corresponding to the supporting image are input into the prototype generation algorithm to obtain a pair of complementary prototypes;
[0011] The pair of complementary prototypes are extended to the size of the query image to be predicted and stitched together with the query image to be predicted. The foreground and background of the stitched image are predicted by the decoder of the FPN structure to obtain the segmentation result.
[0012] The segmentation results are restored to the original image size using a bilinear interpolation algorithm to obtain small sample segmentation results.
[0013] In one implementation, the input supporting image, the mask image corresponding to the supporting image, and the query image to be predicted include, prior to:
[0014] Training is performed based on a meta-learning strategy, which samples tasks from the training set to construct support sample and query sample pairs;
[0015] The process of simulating small sample learning is based on the support samples and query samples.
[0016] In one implementation, the loss function used during training is the cross-loss function:
[0017]
[0018] Where y i p represents the mask label of image i. i This represents the segmentation prediction result for the corresponding query image.
[0019] In one implementation, ResNet50 with pre-trained ImageNet parameters is used as the backbone network during training, and the parameters of ResNet50 are fixed during training.
[0020] In one implementation, ResNet50 includes four convolutional blocks, each outputting features corresponding to different levels of semantic representation, and using dilated convolutions instead of pooling layers.
[0021] In one implementation, the step of inputting the features of the supporting image and the mask image corresponding to the supporting image into a prototype generation algorithm to obtain a pair of complementary prototypes includes:
[0022] The features of the supporting image are labeled as F∈R H×W×C And label the mask image corresponding to the supported image as M∈R H×W ;
[0023] Based on the features of the markers, a pair of complementary prototypes are extracted, and feature information of the foreground region is collected;
[0024] The feature information of the foreground region is aggregated into two complementary clusters.
[0025] In one implementation, the step of extracting a pair of complementary prototypes based on the features of the markers and collecting feature information of the foreground region includes:
[0026] The supporting image features F of the supporting image and the supporting image features M of the mask image that are filtered out from the background region are denoted as:
[0027] F ′ =F⊙M
[0028] Where ⊙ represents element-wise multiplication, and F^' represents the feature after masking.
[0029] Initialize the prototype using masked average pooling (MAP):
[0030]
[0031] Where P0 represents the initial prototype, i and j represent the coordinates of each pixel, and H and W represent the features F. ′ Width and height;
[0032] M i, The sum of M, ∈0,1, represents the area of the foreground region.
[0033] In one implementation, aggregating the feature information of the foreground region into two complementary clusters includes:
[0034] For each iteration t, the prototype is calculated. The cosine distance matrix between the target feature F^' and the target feature:
[0035]
[0036] Among them, S t ∈R H×W S t The similarity metric matrix generated in the t-th iteration is represented by .
[0037] The cosine distance matrix is normalized:
[0038]
[0039] Where i and j represent the coordinates of each pixel, and H and W represent the width and height of feature F^', respectively.
[0040] In one implementation, extending the pair of complementary prototypes to the size of the query image to be predicted and concatenating them with the query image to be predicted includes:
[0041] Weights are assigned using a weight fusion algorithm;
[0042] The pair of complementary prototypes are expanded to the size of the query image to be predicted according to the assigned weights, and then stitched together with the query image to be predicted.
[0043] In one implementation, the weight allocation via a weight fusion algorithm includes:
[0044] Given a supporting image X_i, extract the prototype features of the supporting image X_i as P. i ={P 1, ,P 2,i}; Calculate the cosine distance between the supporting image X_i and the prototypes of other supporting images, and redistribute the weights based on the cosine distance:
[0045]
[0046]
[0047]
[0048] Among them, S i W represents the similarity between the i-th prototype and prototypes from other sources. i P represents the weight of the prototype estimated based on similarity during fusion. merge This represents the prototype after fusion.
[0049] Secondly, the present invention provides a few-sample segmentation device based on an attention mechanism, comprising:
[0050] The input module is used to input the supporting image, the mask image corresponding to the supporting image, and the image to be predicted.
[0051] The extraction module is used to extract features of the query image to be predicted and features of the supporting images respectively through a weight-sharing encoder;
[0052] The prototype generation module is used to input the features of the supporting image and the mask image corresponding to the supporting image into the prototype generation algorithm to obtain a pair of complementary prototypes;
[0053] The segmentation module is used to extend the pair of complementary prototypes to the size of the query image to be predicted, and to stitch them together with the query image to be predicted. The foreground and background of the stitched image are predicted by the decoder of the FPN structure to obtain the segmentation result.
[0054] The size restoration module is used to restore the segmentation result to the original image size using a bilinear interpolation algorithm, thereby obtaining a small sample segmentation result.
[0055] Thirdly, the present invention provides a terminal, comprising: a processor and a memory, wherein the memory stores a few-shot segmentation program based on an attention mechanism, and the few-shot segmentation program based on an attention mechanism, when executed by the processor, is used to implement the operation of the few-shot segmentation method based on an attention mechanism as described in the first aspect.
[0056] Fourthly, the present invention also provides a medium, which is a computer-readable storage medium, storing a few-shot segmentation program based on an attention mechanism, which, when executed by a processor, is used to implement the operation of the few-shot segmentation method based on an attention mechanism as described in the first aspect.
[0057] The present invention, by employing the above technical solution, has the following effects:
[0058] This invention extracts features from the query image to be predicted and features from supporting images using a weight-shared encoder. It then inputs the features from the supporting images and their corresponding mask images into a prototype generation algorithm to obtain a pair of complementary prototypes. These complementary prototypes are then extended to the size of the query image to be predicted and concatenated with it. A decoder with an FPN structure predicts the foreground and background of the concatenated image, resulting in a segmentation outcome. This invention can analyze pixel features that are easily overlooked and those that are easily retained, constructing a pair of prototypes in a weighted complementary manner. This approach maximizes the retention of effective information during prototype generation, reduces classification errors caused by prototype bias, and improves accuracy. Attached Figure Description
[0059] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the structures shown in these drawings without creative effort.
[0060] Figure 1 This is a flowchart of a few-sample segmentation method based on an attention mechanism in one implementation of the present invention.
[0061] Figure 2 This is a network structure framework diagram in one implementation of the present invention.
[0062] Figure 3 This is a schematic diagram of image feature extraction in one implementation of the present invention.
[0063] Figure 4 This is a schematic diagram of a complementary prototype generation algorithm in one implementation of the present invention.
[0064] Figure 5 This is a functional schematic diagram of the terminal in one implementation of the present invention.
[0065] The realization of the objective, functional features and advantages of the present invention will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation
[0066] To make the objectives, technical solutions, and advantages of this invention clearer and more explicit, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
[0067] Exemplary methods
[0068] Existing mainstream methods often use feature extraction and mask pooling algorithms to allow the model to extract the prototype corresponding to the category, and then use K-nearest neighbor classification or discriminant functions to determine whether the pixel features and the prototype belong to the same category. Traditional methods using a single semantic category are insufficient to represent the implicit semantic information in the foreground objects. When encountering categories with varied appearances, as well as different parts or poses shown in different images, this representation method becomes inadequate, making it impossible to accurately determine whether the pixel features and the prototype belong to the same category.
[0069] To address the aforementioned technical issues, this embodiment provides a few-sample segmentation method based on an attention mechanism. This method is a multi-prototype generation method based on an attention mechanism. By analyzing the features extracted from the supporting image, it identifies pixel features that are easily ignored and those that are easily retained. A pair of prototypes is constructed in a complementary weighting manner, thereby preserving as much effective information as possible during prototype generation and reducing classification errors caused by prototype bias.
[0070] like Figure 1 As shown, this embodiment of the invention provides a few-shot segmentation method based on an attention mechanism, comprising the following steps:
[0071] Step S100: Input the supporting image, the mask image corresponding to the supporting image, and the image to be predicted.
[0072] In this embodiment, the attention-based few-sample segmentation method is applied to a terminal, which includes, but is not limited to, devices such as computers.
[0073] This embodiment primarily employs a discriminant function to determine distance. Such methods first use an encoding network to extract features from both the support and query images. Then, by combining the feature maps of the support images with their corresponding mask labels, different strategies are employed to generate prototypes. This is a crucial step in few-shot semantic segmentation models; therefore, this embodiment proposes a novel algorithm based on the discriminant function principle.
[0074] The method in this embodiment proposes a simple and effective complementary prototype generation network to address the problem of low accuracy in pixel feature and prototype identification. Specifically, the complementary prototype generation network in this embodiment includes a complementary prototype generation (CPG) algorithm that extracts complementary prototypes for each image from a support set, and a weighted fusion (WM) algorithm that fuses the complementary prototypes of multiple images.
[0075] Specifically, in one implementation of this embodiment, the following steps are included before step S100:
[0076] Step S010: Train according to the meta-learning strategy, sample tasks from the training set, and construct support sample and query sample pairs;
[0077] Step S020: Simulate the small sample learning process based on the support samples and query samples.
[0078] In this embodiment, the model used is a semantic segmentation model that can learn quickly. This model can quickly learn to segment a new category from support set images and their corresponding annotations. In this embodiment, the dataset is processed according to the meta-learning task setting, given images from sets C_seen and C_unseen of different categories.
[0079] During model training, the training set D_train is constructed from C_seen, and the test set D_test is constructed from C_unseen. In this embodiment, the segmentation model M is trained on D_train and evaluated on D_test. Both the training set D_train and the test set D_test consist of multiple meta-tasks. Each meta-task consists of a set of support images S (annotated) and a set of query images Q (set to 1 in this embodiment), which are respectively... and Where N_train and N_test represent the number of meta-tasks for training and testing, respectively.
[0080] Each training or testing meta-task includes a C-way K-shot segmentation learning task, where C represents the number of classes in the support images, and K represents the number of samples in each class of support images, i.e., the support set S. i It contains C semantic categories, each category has K <image, tag> pairs, and the query set Q i Typically, it contains one query set image that the model needs to predict. The model learns by being trained on base categories, learning how to extract effective knowledge from the support set, and then applying the learned knowledge to the segmentation of the query set.
[0081] This embodiment proposes a multi-prototype generation mechanism based on attention, which can preserve the semantic associations between different parts while compensating for the loss of distinctive semantic information caused by a single prototype. Compared with multi-part prototype generation modes, the method in this embodiment does not destroy the semantic associations between different parts, thus still achieving a certain performance improvement under a 1-shot setting. The overall algorithm framework is as follows: Figure 2 As shown, the trained network model can be directly used for semantic segmentation tasks with few samples.
[0082] like Figure 2 As shown, the network model in this embodiment uses a meta-learning strategy for learning and testing.
[0083] In terms of training method, this embodiment adopts a general meta-learning strategy. First, tasks are sampled from the training set to construct support and query sample pairs to simulate few-shot learning. The loss function used during training is the cross-loss function, which is expressed as:
[0084]
[0085] Where y i p represents the mask label of image i. i This represents the segmentation prediction result for the corresponding query image.
[0086] In this embodiment, the algorithm aims to utilize a feature extractor to extract feature representations from different layers of a convolutional neural network for feature matching. This embodiment uses a ResNet50 with pre-trained ImagingNet parameters as the backbone network, and its parameters are fixed during training.
[0087] Previous research on CNN feature visualization has shown that low-level output features are often associated with low-level visual features, such as edges and colors, while high-level features are associated with instance-level concepts, such as object categories. In few-shot scenarios, the few-shot model needs to be able to adapt to any unseen object. Therefore, the model may not necessarily learn the semantic representation of new categories during training. To allow the model to effectively utilize pre-trained knowledge, this embodiment uses intermediate-layer features for new category recognition in its feature extraction part. This is because high-level features contain more class-related information than intermediate-level features, and since new categories have not appeared during training, using high-level features leads to weaker generalization of new categories.
[0088] like Figure 3 As shown, ResNet is divided into four convolutional blocks, and the output features correspond to different levels of semantic representation. In this embodiment, the algorithm fuses the features from Block-2 and Block-3 for feature comparison. To reduce the resolution loss caused by model downsampling, this embodiment uses dilated convolutions instead of pooling layers. Therefore, all features after Block-2 retain 1 / 8 of the input size. Then, the features from Block-2 and Block-3 are merged together and encoded into a 256-dimensional feature matrix using a 1×1 kernel.
[0089] like Figure 1 As shown, in one implementation of this invention, the few-sample segmentation method based on the attention mechanism further includes the following steps:
[0090] Step S200: Extract the features of the query image to be predicted and the features of the supporting images respectively through a weight-sharing encoder.
[0091] In this embodiment, during the model's inference process, the input includes a support image and its corresponding mask image, as well as a query image to be predicted. A weight-sharing encoder then extracts features from the query image and the support image. After extracting these features, the features of the support image and its corresponding mask label are fed as input into the prototype generation algorithm proposed in this embodiment, resulting in a pair of complementary prototypes.
[0092] like Figure 1 As shown, in one implementation of this invention, the few-sample segmentation method based on the attention mechanism further includes the following steps:
[0093] Step S300: Input the features of the supporting image and the mask image corresponding to the supporting image into the prototype generation algorithm to obtain a pair of complementary prototypes.
[0094] like Figure 4 As shown, in this embodiment, the complementary prototype generation algorithm is used to extract comprehensive category-related semantic information from the support set (i.e., support images). Through global average pooling with support masks, the algorithm first extracts an average prototype, then iteratively calculates the cosine distance between the foreground features and the average prototype to obtain their attention weights on the support images, and updates the prototype. In this way, the model can easily figure out which parts of the information are focused on and which parts are ignored by the prototype. Then, a pair of prototypes is generated based on this pair of complementary attention weights to represent the regions of focus and neglect. By comparing prototypes generated using attention weight maps, this embodiment can preserve the correlation between generated prototypes and avoid a certain degree of information loss.
[0095] Specifically, in one implementation of this embodiment, step S300 includes the following steps:
[0096] Step S301: Label the features of the supporting image as F∈R H×W×C And label the mask image corresponding to the supported image as M∈R H×W ;
[0097] Step S302: Extract a pair of complementary prototypes based on the features of the markers, and collect feature information of the foreground region.
[0098] In one implementation of this embodiment, step S302 includes the following steps:
[0099] Step S302a: Filter out the supporting image features F and the mask image features M that belong to the background region. This is denoted as:
[0100] F ′ =F⊙M
[0101] Where ⊙ represents element-wise multiplication, and F^' represents the feature after masking.
[0102] Step S302b: Initialize the prototype using masked average pooling (MAP):
[0103]
[0104] Where P0 represents the initial prototype, i and j represent the coordinates of each pixel, and H and W represent the features F. ′ Width and height;
[0105] M i, The sum of M, ∈0,1, represents the area of the foreground region.
[0106] In this embodiment, a complementary prototype generation algorithm based on an attention mechanism is used to capture discriminative features in foreground objects and reduce the prototype bias problem caused by single prototype representation. In the first step, the algorithm uses an image feature extraction network to extract the feature map supporting the image and the corresponding label, and uses them as input to extract a pair of complementary prototypes to collect feature information of the foreground region.
[0107] In the first step, in this embodiment, the supporting image features belonging to the background region are filtered out based on the feature map F of the supporting image and its label M. Then, the prototype is initialized by masked average pooling (MAP) to obtain the initial prototype.
[0108] Specifically, in one implementation of this embodiment, step S300 further includes the following steps:
[0109] Step S303: Aggregate the feature information of the foreground region into two complementary clusters.
[0110] In one implementation of this embodiment, step S303 includes the following steps:
[0111] Step S303a: For each iteration t, calculate the prototype. The cosine distance matrix between the target feature F^' and the target feature:
[0112]
[0113] Among them, S t ∈R H×W S t The similarity metric matrix generated in the t-th iteration is represented by .
[0114] Step S303b: Normalize the cosine distance matrix:
[0115]
[0116] Where i and j represent the coordinates of each pixel, and H and W represent the width and height of feature F^', respectively.
[0117] In this embodiment, in the second step, the foreground features are aggregated into two complementary clusters. Specifically, for each iteration, the cosine distance matrix between the prototype and target features obtained in the previous step is first calculated, and then the obtained cosine distance matrix is normalized.
[0118] Since ReLU is used as the activation function in the decoder in this embodiment, the feature map F is calculated pixel by pixel. ′ and When calculating similarity, the range of cosine distance is restricted to [0,1]. This is to compute the differences in features for the initial prototype. The weight contribution, in this embodiment, is matrix S. t Normalization is performed, and the normalization formula is shown above.
[0119] In this embodiment, in the third step, after the iteration is completed, based on the weights of the final similarity metric matrix S^t, the prototype-related information that is easily retained and the prototype-related information that is easily lost are estimated. The calculation method is as follows:
[0120]
[0121]
[0122] Here, P1 and P2 represent a pair of complementary prototypes generated by the algorithm's objective.
[0123] It should be noted that the prototype generation method in this embodiment is not generated by clustering the foreground supporting the features into multiple parts through modeling as in previous work. One advantage of doing so is that it does not reduce the correlation between prototypes in each part. Instead, it compensates for the loss of semantic information caused by a single prototype by introducing this complementary similarity-based attention mechanism, thereby alleviating the prototype bias problem of the model.
[0124] like Figure 1 As shown, in one implementation of this invention, the few-sample segmentation method based on the attention mechanism further includes the following steps:
[0125] Step S400: Extend the pair of complementary prototypes to the size of the query image to be predicted, and stitch them together with the query image to be predicted. Use the decoder of the FPN structure to predict the foreground and background of the stitched image to obtain the segmentation result.
[0126] In this embodiment, a pair of complementary prototypes are extended to the size of the query image to be predicted by a weighted fusion algorithm (i.e., the prototype fusion module of K-shot), and then stitched together with the query image to be predicted, thereby realizing the fusion process between the two.
[0127] In scenarios with small sample sizes, there are often multiple supporting images, also known as K-shot scenarios. A simple approach is to guide the model to segment the interpolated images using each supporting image, and then summarize the different prediction results using strategies such as averaging or taking the maximum. However, this method is not efficient because the inference time increases with the number of supporting images.
[0128] To efficiently and effectively fuse information from different supporting samples, existing methods employ an averaging strategy to fuse prototypes from different sources, namely:
[0129]
[0130]
[0131] Here, P1 and P2 represent the complementary prototypes after fusion using the averaging algorithm, K represents the number of supporting images, and j represents the j-th supporting image. The fused prototypes are then used to guide the model for segmentation. However, in small sample scenarios, the value of K is relatively small, and direct averaging is not conducive to improving the quality of prototype fusion.
[0132] According to statistical principles, some low-quality prototypes acting as outliers can cause the fused prototype to deviate from the true prototype. Therefore, this embodiment proposes a new prototype fusion algorithm, namely the weighted fusion algorithm.
[0133] Specifically, in one implementation of this embodiment, step S400 includes the following steps:
[0134] Step S401: Assign weights using a weight fusion algorithm.
[0135] In one implementation of this embodiment, step S401 includes the following steps:
[0136] Step S401a: Given a supporting image X_i, extract the prototype features of the supporting image X_i as P. i ={P 1, ,P 2,i};
[0137] Step S401b: Calculate the cosine distance between the supporting image X_i and the prototypes of other supporting images, and redistribute the weights based on the cosine distance:
[0138]
[0139]
[0140]
[0141] Among them, S i W represents the similarity between the i-th prototype and prototypes from other sources. i P represents the weight of the prototype estimated based on similarity during fusion. merge This represents the prototype after fusion.
[0142] In the weight fusion algorithm of this embodiment, given a supporting image X_i, its prototype feature P can be extracted.i ={P 1, ,P 2, The weights are redistributed by calculating the cosine distance between the prototype and other supporting images, as shown above. In this way, the method in this embodiment can further and more effectively utilize different prototypes to generate higher quality prototypes.
[0143] Specifically, in one implementation of this embodiment, step S400 further includes the following steps:
[0144] Step S402: Expand the pair of complementary prototypes to the size of the query image to be predicted according to the assigned weights, and stitch them together with the query image to be predicted.
[0145] In this embodiment, the weighted fusion algorithm described above is used to extend a pair of complementary prototypes to the size of the query feature map and then concatenate them. After concatenation, the concatenated prototypes are fed into the decoder of the FPN structure, which acts as a metric function and has a multi-scale segmentation effect, to predict the background before and after the query set.
[0146] like Figure 1 As shown, in one implementation of this invention, the few-sample segmentation method based on the attention mechanism further includes the following steps:
[0147] Step S500: The segmentation result is restored to the original image size using a bilinear interpolation algorithm to obtain a small sample segmentation result.
[0148] Finally, in this embodiment, bilinear interpolation is used to process the segmentation results and restore them to the original image size for model evaluation. In this embodiment, a pair of prototypes is constructed in a weighted complementary manner to retain as much effective information as possible during the prototype generation process, reduce classification errors caused by prototype bias, and improve judgment accuracy.
[0149] This embodiment achieves the following technical effects through the above technical solution:
[0150] This embodiment extracts features from the query image to be predicted and features from supporting images using a weight-shared encoder. The features from the supporting images and their corresponding mask images are then input into a prototype generation algorithm to obtain a pair of complementary prototypes. These complementary prototypes are then extended to the size of the query image to be predicted and concatenated with it. A decoder with an FPN structure predicts the foreground and background of the concatenated image, resulting in a segmentation result. This embodiment can analyze pixel features that are easily overlooked and those that are easily retained, constructing a pair of prototypes in a weighted complementary manner. This approach maximizes the retention of effective information during prototype generation, reduces classification errors caused by prototype bias, and improves judgment accuracy.
[0151] Exemplary device
[0152] Based on the above embodiments, the present invention also provides a few-sample segmentation device based on an attention mechanism, comprising:
[0153] The input module is used to input the supporting image, the mask image corresponding to the supporting image, and the image to be predicted.
[0154] The extraction module is used to extract features of the query image to be predicted and features of the supporting images respectively through a weight-sharing encoder;
[0155] The prototype generation module is used to input the features of the supporting image and the mask image corresponding to the supporting image into the prototype generation algorithm to obtain a pair of complementary prototypes;
[0156] The segmentation module is used to extend the pair of complementary prototypes to the size of the query image to be predicted, and to stitch them together with the query image to be predicted. The foreground and background of the stitched image are predicted by the decoder of the FPN structure to obtain the segmentation result.
[0157] The size restoration module is used to restore the segmentation result to the original image size using a bilinear interpolation algorithm, thereby obtaining a small sample segmentation result.
[0158] Based on the above embodiments, the present invention also provides a terminal, the principle block diagram of which can be as follows: Figure 5 As shown.
[0159] The terminal includes: a processor, a memory, an interface, a display screen, and a communication module connected via a system bus; wherein, the processor of the terminal provides computing and control capabilities; the memory of the terminal includes a storage medium and internal memory; the storage medium stores the operating system and computer programs; the internal memory provides an environment for the operation of the operating system and computer programs in the storage medium; the interface is used to connect to external devices, such as mobile terminals and computers; the display screen is used to display relevant information; and the communication module is used to communicate with a cloud server or mobile terminal.
[0160] When executed by the processor, this computer program is used to implement a few-shot segmentation method based on an attention mechanism.
[0161] It will be understood by those skilled in the art that Figure 5 The schematic diagram shown is merely a partial structural diagram related to the present invention and does not constitute a limitation on the terminal to which the present invention is applied. A specific terminal may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0162] In one embodiment, a terminal is provided, comprising: a processor and a memory, the memory storing an attention-based few-shot segmentation program, which, when executed by the processor, is used to implement the operation of the attention-based few-shot segmentation method as described above.
[0163] In one embodiment, a storage medium is provided, wherein the storage medium stores an attention-based few-shot segmentation program, which, when executed by a processor, is used to implement the operation of the attention-based few-shot segmentation method described above.
[0164] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, database, or other media used in the embodiments provided by this invention can include non-volatile and / or volatile memory.
[0165] In summary, this invention provides a few-shot segmentation method, apparatus, terminal, and medium based on an attention mechanism. The method includes: inputting a support image, a mask image corresponding to the support image, and a query image to be predicted; extracting features from the query image to be predicted and features from the support image using a weight-shared encoder; inputting the features from the support image and the mask image corresponding to the support image into a prototype generation algorithm to obtain a pair of complementary prototypes; expanding the pair of complementary prototypes to the size of the query image to be predicted and concatenating them with the query image to be predicted; predicting the foreground and background of the concatenated image using an FPN-structured decoder to obtain the segmentation result; and restoring the segmentation result to the original image size using a bilinear interpolation algorithm to obtain the few-shot segmentation result. This invention constructs a pair of prototypes in a weighted complementary manner, thereby preserving as much effective information as possible during prototype generation, reducing classification errors caused by prototype bias, and improving judgment accuracy.
[0166] It should be understood that the application of the present invention is not limited to the examples above. Those skilled in the art can make improvements or modifications based on the above description, and all such improvements and modifications should fall within the protection scope of the appended claims.
Claims
1. A few-sample segmentation method based on an attention mechanism, characterized in that, include: Input the supporting image, the mask image corresponding to the supporting image, and the query image to be predicted; The features of the query image to be predicted and the features of the supporting images are extracted by a weight-sharing encoder. The features of the supporting image and the mask image corresponding to the supporting image are input into the prototype generation algorithm to obtain a pair of complementary prototypes; The pair of complementary prototypes are extended to the size of the query image to be predicted and stitched together with the query image to be predicted. The foreground and background of the stitched image are predicted by the decoder of the FPN structure to obtain the segmentation result. The segmentation results are restored to the original image size using a bilinear interpolation algorithm to obtain small sample segmentation results; The step of extending the pair of complementary prototypes to the size of the query image to be predicted and stitching them together with the query image to be predicted includes: Weights are assigned using a weight fusion algorithm; The pair of complementary prototypes are expanded to the size of the query image to be predicted according to the assigned weights, and then stitched together with the query image to be predicted.
2. The attention-based few-sample segmentation method according to claim 1, characterized in that, The input includes a supported image, a mask image corresponding to the supported image, and a query image to be predicted, which previously included: Training is performed based on a meta-learning strategy, which samples tasks from the training set to construct support sample and query sample pairs; The process of simulating small sample learning is based on the support samples and query samples.
3. The attention-based few-sample segmentation method according to claim 2, characterized in that, The loss function used during training is the cross-loss function: ; in This indicates the mask label for querying image i. This represents the segmentation prediction result for the corresponding query image.
4. The attention-based few-sample segmentation method according to claim 2, characterized in that, During training, ResNet50 with pre-trained ImagNet parameters is used as the backbone network, and the parameters of ResNet50 are fixed during training.
5. The attention-based few-sample segmentation method according to claim 1, characterized in that, ResNet50 consists of four convolutional blocks, each outputting features corresponding to different levels of semantic representation, and uses dilated convolutions instead of pooling layers.
6. The few-sample segmentation method based on attention mechanism according to claim 1, characterized in that, The step of inputting the features of the supporting image and the mask image corresponding to the supporting image into the prototype generation algorithm to obtain a pair of complementary prototypes includes: The features of the supporting images are labeled as And marking the mask image corresponding to the supported image as ; Based on the features of the markers, a pair of complementary prototypes are extracted, and feature information of the foreground region is collected; The feature information of the foreground region is aggregated into two complementary clusters.
7. The attention-based few-shot segmentation method according to claim 6, characterized in that, The step of extracting a pair of complementary prototypes based on the features of the markers and collecting feature information of the foreground region includes: The supporting image features F of the supporting image and the supporting image features M of the mask image that are filtered out from the background region are denoted as: Where ⊙ denotes element-wise multiplication. This represents the features after masking. Initialize the prototype using masked average pooling (MAP): in, This represents the initial prototype, where i and j represent the coordinates of each pixel, and H and W represent the features. Width and height; The sum of M represents the area of the foreground region.
8. The attention-based few-shot segmentation method according to claim 6, characterized in that, The step of aggregating the feature information of the foreground region into two complementary clusters includes: For each iteration t, the prototype is calculated. and target features Cosine distance matrix between them: in, , The similarity metric matrix generated in the t-th iteration is represented by . The cosine distance matrix is normalized: Where i and j represent the coordinates of each pixel, and H and W represent the features, respectively. Width and height.
9. The attention-based few-sample segmentation method according to claim 1, characterized in that, The weight allocation through the weight fusion algorithm includes: Given a supporting image X_i, extract the prototype features of the supporting image X_i as follows: ; Calculate the cosine distance between the supporting image X_i and the prototypes of other supporting images, and redistribute the weights based on the cosine distance: in, This represents the similarity between the i-th prototype and prototypes from other sources. This represents the weight of the prototype estimated based on similarity during fusion. This represents the prototype after fusion.
10. A few-sample segmentation device based on an attention mechanism, used to implement the few-sample segmentation method based on an attention mechanism as described in any one of claims 1-9, characterized in that, include: The input module is used to input the supporting image, the mask image corresponding to the supporting image, and the image to be predicted. The extraction module is used to extract features of the query image to be predicted and features of the supporting images respectively through a weight-sharing encoder; The prototype generation module is used to input the features of the supporting image and the mask image corresponding to the supporting image into the prototype generation algorithm to obtain a pair of complementary prototypes; The segmentation module is used to extend the pair of complementary prototypes to the size of the query image to be predicted, and to stitch them together with the query image to be predicted. The foreground and background of the stitched image are predicted by the decoder of the FPN structure to obtain the segmentation result. The size restoration module is used to restore the segmentation result to the original image size using a bilinear interpolation algorithm, thereby obtaining a small sample segmentation result.
11. A terminal, characterized in that, include: The processor and memory, wherein the memory stores a few-shot segmentation program based on an attention mechanism, which, when executed by the processor, is used to implement the operation of the few-shot segmentation method based on an attention mechanism as described in any one of claims 1-9.
12. A medium, characterized in that, The medium is a computer-readable storage medium that stores a few-shot segmentation program based on an attention mechanism. When executed by a processor, the few-shot segmentation program based on an attention mechanism is used to implement the operation of the few-shot segmentation method based on an attention mechanism as described in any one of claims 1-9.