An image segmentation method, device, equipment and storage medium
By introducing learnable supplementary token vectors and co-optimizing the main backbone network, the problem of image segmentation models being sensitive to changes in illumination and blurring is solved, improving segmentation accuracy and reducing resource requirements, making it suitable for image segmentation applications on ordinary devices.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- WUHAN UNIV OF SCI & TECH
- Filing Date
- 2026-03-05
- Publication Date
- 2026-06-19
AI Technical Summary
Existing image segmentation models are sensitive to factors such as illumination changes and blurring, and large models are difficult to deploy on ordinary devices. Traditional distillation methods have high resource requirements and it is difficult to improve segmentation accuracy while maintaining low resource requirements.
We introduce learnable supplementary token vectors, combine the main backbone network and the main decoder, and adopt a main and branch collaborative optimization strategy. We extract features through the main backbone network and combine them with supplementary token vectors for feature fusion. We also use cross-attention mechanism and multilayer perceptron to improve feature representation capability.
It significantly improves image segmentation accuracy, overcomes the effects of illumination changes and boundary blurring, reduces resource requirements, and achieves high-precision, low-cost image segmentation, suitable for ordinary equipment.
Smart Images

Figure CN122244436A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer vision technology, and in particular to an image segmentation method, apparatus, device and storage medium. Background Technology
[0002] In the field of computer vision, neural network models can be used to process images and segment them into multiple semantic segments, making it easier to analyze different objects or regions within the image. However, existing models are highly sensitive to factors such as lighting variations and blurring in images. Furthermore, image segmentation requires pixel-by-pixel classification, making them susceptible to the influence of blurred boundaries between objects and between objects and the background. These factors pose significant challenges to the segmentation accuracy of the models.
[0003] To accurately classify pixel-level attributes, both local texture information and global contextual information need to be considered. Related techniques, such as receptive field expansion, multi-scale feature fusion, attention mechanisms, and emphasizing local information, can acquire richer semantic cues to improve the model's segmentation accuracy.
[0004] In addition to the methods mentioned above, large models with zero-shot generalization capabilities can be employed. By training on extensive segmentation datasets, these large models can learn more diverse and finer-grained features, thereby achieving remarkable segmentation performance. However, the ever-increasing number of parameters raises the hardware requirements for large models, limiting their deployment and practical application on ordinary devices. Although model size can be effectively reduced through network pruning and parameter quantization, it is difficult to capture complex and abstract features or fully learn rich semantic information, which may lead to performance degradation and training instability.
[0005] Furthermore, knowledge distillation methods can be used to align the soft labels of teacher and student models. By leveraging output responses, hierarchical feature representations, and inter-sample dependencies, the student model, with fewer parameters, can achieve performance equal to or even surpass that of the teacher model. However, traditional distillation methods still require a large pre-trained model as a prerequisite, resulting in significant demands on training resources.
[0006] Therefore, there is currently a lack of image segmentation methods that can overcome the above-mentioned shortcomings. Summary of the Invention
[0007] This application provides an image segmentation method, apparatus, device, and storage medium to address the shortcomings of the aforementioned related technologies. The technical solution is as follows: In a first aspect, this application provides an image segmentation method, characterized by comprising: Obtain the image to be processed and the supplementary token vector obtained during training; The supplementary token vector and the image to be processed are used as inputs to the image segmentation model; wherein, the image segmentation model includes a main backbone network and a main decoder, and the processing procedure of the image segmentation model includes: Feature extraction is performed on the image to be processed using the main backbone network, and a fused feature map is obtained by combining the supplementary token vector. The fused feature map is decoded by the main decoder to obtain the image segmentation result corresponding to the image to be processed.
[0008] In one alternative embodiment of the first aspect, the main backbone network includes N processing layers, which are connected sequentially in a preset order. Each processing layer corresponds to a supplementary token module, and each processing layer includes K operation modules. Each operation module is used to perform a preset type of operation on the input feature map. The supplementary token vector includes N sub-tokens, and each sub-token includes K-1 vectors. The step of extracting features from the image to be processed through the main backbone network and obtaining a fused feature map by combining the supplementary token vector includes: The fusion feature map of the (i-1)th processing layer is input into the ith processing layer; wherein, the fusion feature map of the first processing layer is obtained based on the first sub-token of the supplementary token vector and the image to be processed; In the i-th processing layer, the (j+1)-th operation module of the i-th processing layer takes the output feature map of the j-th operation module as input and outputs the (j+1)-th intermediate feature map; wherein, the output feature map of the first operation module of the i-th processing layer is obtained based on the fused feature map of the (i-1)-th processing layer; The supplementary token module of the i-th processing layer processes the output feature map of the j-th operation module and the j-th vector of the i-th sub-token to obtain the j-th supplementary feature map; The output feature map of the (j+1)th operation module is obtained by weighted summation of the (j+1)th intermediate feature map and the jth supplementary feature map. The process continues until the output feature map of the Kth operation module is obtained, which is then used as the fusion feature map of the i-th processing layer. The fused feature map of the i-th processing layer is used as the input of the (i+1)-th processing layer until the fused feature map of the N-th processing layer is obtained.
[0009] In one alternative embodiment of the first aspect, the supplementary token module of the i-th processing layer obtains the j-th supplementary feature map based on the output feature map of the j-th operation module and the j-th vector of the i-th sub-token, including: The key vector and value vector are obtained by processing the output feature map of the j-th operation module; The query vector is obtained by processing the j-th vector based on the i-th sub-token; The key vector, value vector, and query vector are input into the cross-attention mechanism to obtain the attention feature map; The j-th supplementary feature map is obtained by fusing the attention feature map and the output feature map of the j-th operation module based on the multilayer perceptron.
[0010] In one alternative embodiment of the first aspect, the training process of the image segmentation model specifically includes: Obtain a pre-constructed sample set; wherein the sample set includes multiple samples, the sample input of each sample is the original image, and the sample label is the true segmentation result obtained based on the annotation of the original image; Training is performed in multiple iterations based on the sample set, with main training and branch training performed in each iteration. In the mainline training, the mainline prediction result obtained by the image segmentation model based on the sample input and the supplementary token vector of the current iteration is obtained; The loss function of the image segmentation model is constructed based on the main prediction results and sample labels, and backpropagation is performed to update the model parameters of the image segmentation model. In branch training, the branch prediction results obtained by the image segmentation partner model based on the sample input and the supplementary token vector of the current iteration are acquired. The loss function of the image segmentation partner model is constructed based on the branch prediction segmentation results and sample labels, and backpropagation is performed to update the supplementary token vector and the model parameters of the image segmentation partner model. The next iteration of training is performed based on the updated supplementary token vector and model parameters until the model converges, and the trained image segmentation model is output.
[0011] In one alternative of the first aspect, the image segmentation partner model includes a partner backbone network, a partner decoder, and N token query modules; The acquisition of the branch prediction results obtained by the image segmentation partner model based on the sample input and the supplementary token vector of the current iteration includes: Input each sub-token of the supplementary token vector in the current iteration round into the corresponding token query module to generate the corresponding result query representation; The fused query vector is obtained by summing the query results from each token query module. Feature extraction is performed based on sample input through the partner backbone network, and sample feature maps are output. The fused query vector and the sample feature map are input into the partner decoder to obtain the branch prediction result.
[0012] In one alternative embodiment of the first aspect, the step of inputting each sub-token of the supplementary token vector of the current iteration round into the corresponding token query module to generate a corresponding result query representation includes: Input the i-th sub-token of the supplementary token vector in the current iteration into the i-th token query module. The i-th token query module performs a linear transformation on each vector of the i-th sub-token to obtain the query vector corresponding to each vector. The maximum query vector component and the average query vector component are calculated based on all query vectors obtained from the i-th token query module. The query vector corresponding to the last vector of the i-th sub-token is concatenated with the maximum query vector component, the average query vector component, and the query vector corresponding to the last vector of the i-th sub-token. The concatenation result is then linearly transformed to obtain the result query representation of the i-th token query module.
[0013] In one alternative of the first aspect, the loss function for constructing the image segmentation partner model based on the branch prediction segmentation results and sample labels is expressed as follows: ; in, Indicates sample input, Indicates a partner decoder, Indicates a partner backbone network, Indicates sample label, This represents the supplementary token vector.
[0014] Secondly, this application also provides an image segmentation apparatus, comprising: The data acquisition module acquires the image to be processed and the supplementary token vector obtained during training. An image processing module is used to take the supplementary token vector and the image to be processed as input to an image segmentation model; wherein, the image segmentation model includes a main backbone network and a main decoder, and the processing procedure of the image segmentation model includes: The image processing module is also used to extract features from the image to be processed through the main backbone network and obtain a fused feature map by combining the supplementary token vector; The image processing module is further configured to decode the fused feature map using the main decoder to obtain the image segmentation result corresponding to the image to be processed.
[0015] Thirdly, this application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the method provided by the first aspect of this application or any implementation thereof.
[0016] Fourthly, this application also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the method provided by the first aspect of this application or any implementation thereof.
[0017] The beneficial effects of the technical solution provided in this application include at least the following: This application introduces learnable supplementary token vectors and combines the main backbone network with the main decoder. During training, a strategy of co-optimization between the main and branch models is employed, enabling the main model to effectively acquire and fuse rich feature representation patterns from the partner models. This generates a fused feature map with more comprehensive semantic information, significantly improving image segmentation accuracy. This method effectively overcomes the sensitivity of traditional models to factors such as illumination changes, image blurring, and blurred boundaries between the target and background, enabling more accurate identification of different object regions and their edges, resulting in segmentation results that better reflect real-world scenarios. Furthermore, this application does not rely on large pre-trained models or complex knowledge distillation processes; it can continue to improve accuracy after the model reaches its performance bottleneck simply through supplementary learning from the partner models. It demonstrates good universality and bidirectionality, significantly enhancing the model's semantic representation capabilities while maintaining low resource requirements, providing a new and effective approach for achieving high-precision, low-deployment-cost image segmentation. Attached Figure Description
[0018] To more clearly illustrate the technical solutions in this application or related technologies, the drawings used in the description of the embodiments or related technologies will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0019] Figure 1 This is a schematic flowchart of an image segmentation method provided in an embodiment of this application; Figure 2 This is one of the structural schematic diagrams of the image segmentation model provided in this application, as provided in the embodiments of this application; Figure 3 This is a second schematic diagram of the structure of the image segmentation model provided in this application, as shown in the embodiments of this application; Figure 4 This is one of the schematic diagrams illustrating the prediction result of an image segmentation method provided in this application embodiment; Figure 5 This is a second schematic diagram illustrating the prediction result of an image segmentation method provided in this application embodiment; Figure 6 This is a schematic diagram of the structure of an image segmentation device provided in an embodiment of this application; Figure 7This is a schematic diagram of the structure of the electronic device provided in the embodiments of this application. Detailed Implementation
[0020] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0021] The terms "comprising" and "having," and any variations thereof, in the specification, claims, and accompanying drawings of this application are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or apparatus that includes a series of steps or modules is not limited to the steps or modules listed, but may optionally include steps or modules not listed, or may optionally include other steps or modules inherent to such process, method, product, or apparatus.
[0022] It should be noted that the terms "first" and "second" used in this application are merely to distinguish similar objects and do not represent a specific ordering of the objects. It is understood that "first" and "second" can be interchanged in a specific order or sequence where permitted. It should be understood that the objects distinguished by "first" and "second" can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in an order other than those described or illustrated herein.
[0023] The present application will now be described in detail with reference to specific embodiments.
[0024] Next, combine Figure 1 This paper introduces an image segmentation method provided by an embodiment of this application. For details, please refer to [link to relevant documentation]. Figure 1 , Figure 1 A flowchart illustrating an image segmentation method provided in an embodiment of this application is shown. Figure 1 As shown, the method includes the following steps: S101, Obtain the image to be processed and the supplementary token vector obtained from training; S102, the supplementary token vector and the image to be processed are used as input to the image segmentation model; wherein, the image segmentation model includes a main backbone network and a main decoder, and the processing procedure of the image segmentation model includes: S103, the image to be processed is feature extracted through the main backbone network, and a fused feature map is obtained by combining the supplementary token vector; S104, the fused feature map is decoded by the main decoder to obtain the image segmentation result corresponding to the image to be processed.
[0025] In some embodiments, such as Figure 2-3 As shown, Figure 2 and Figure 3 A schematic diagram illustrating the structure of the image segmentation model provided in this application is shown.
[0026] like Figure 2 For example, the backbone network of the image segmentation model includes N processing layers, which are connected sequentially in a preset order. Each processing layer corresponds to a supplementary token module, and the last processing layer is connected to the main decoder. Each operation module is used to perform a preset type of operation on the input feature map. The supplementary token vector T includes N sub-tokens, represented as... m=K-1, is the length of each sub-token, and c represents the number of channels. For example... Figure 3 For example, each processing layer includes K operation modules, and each sub-token includes K-1 vectors. .
[0027] It should be noted that the operation types of the operation module include, but are not limited to, convolution, attention mechanisms, etc., and the embodiments of this application do not limit them.
[0028] Specifically, the step of extracting features from the image to be processed through the main backbone network and obtaining a fused feature map by combining the supplementary token vector includes: S201, fused feature map of the (i-1)th processing layer Input the i-th processing layer as the input of the i-th processing layer. ; It should be noted that the fused feature map of the first processing layer Based on the first sub-token of the supplementary token vector and the image to be processed Processed; S202, in the i-th processing layer, the (j+1)-th operation module of the i-th processing layer uses the output feature map of the j-th operation module. As input, output the (j+1)th intermediate feature map. The formula is expressed as: , ; in, This represents the (j+1)th operation module of the i-th processing layer.
[0029] It should be noted that, in combination Figure 3 As shown, the first operation module of the i-th processing layer performs the corresponding type of operation processing based on the fused feature map of the (i-1)-th processing layer to obtain the first intermediate feature map. intermediate feature map Directly used as the output feature map of the first operation module : ; S203, the supplementary token module of the i-th processing layer is based on the output feature map of the j-th operation module. and the j-th vector of the i-th sub-token The j-th supplementary feature map is obtained through processing. The formula is expressed as: ; in, This represents the supplementary token module for the i-th processing layer.
[0030] S204, based on the (j+1)th intermediate feature map and the j-th supplementary feature map We perform a weighted summation to obtain the output feature map of the (j+1)th operation module. The formula is expressed as: ; in, This represents the weights of the supplementary feature map.
[0031] S205, until the output feature map of the Kth operation module is obtained. , as the fusion feature map of the i-th processing layer ; S206, fused feature map of the i-th processing layer This is used as the input to the (i+1)th processing layer until the fused feature map of the Nth processing layer is obtained. .
[0032] In some embodiments, the process of obtaining the supplementary feature map in step S203 specifically includes: S301, Based on the output feature map of the j-th operation module The process yields key vectors and value vectors, expressed by the following formula: , ; S302, based on the j-th vector of the i-th sub-token. The resulting query vector is expressed by the formula: ; in, This represents the query vector mapping function corresponding to the key vector. This represents the query vector mapping function corresponding to the value vector. This represents the query vector mapping function corresponding to the query vector.
[0033] Specifically, it can be expressed as the following formula: ; in, Reference or , It is a 3×3 depthwise convolution. and Both represent 1×1 point convolutions.
[0034] S303, input the key vector, value vector, and query vector into the cross-attention mechanism to obtain the attention feature map. The formula is expressed as: ; Where C represents the dimension of the key vector.
[0035] S304, Attention Feature Map Fusion Based on Multilayer Perceptron and the output feature map of the j-th operation module The j-th supplementary feature map is obtained through processing. The formula is expressed as: .
[0036] In some embodiments, the training process of the image segmentation model specifically includes: S401, Obtain a pre-constructed sample set; wherein, the sample set includes multiple samples, the sample input of each sample is the original image, and the sample label is the real segmentation result obtained based on the annotation of the original image.
[0037] It should be noted that the sample set can be constructed based on the Cityscapes dataset, which is one of the most representative datasets in the field of urban street view semantic segmentation. This application does not limit this.
[0038] S402, perform training in multiple iterations based on the sample set, and perform main training and branch training in each iteration. In the mainline training, the mainline model is trained, including: S4021, Obtain the mainline prediction result obtained by the image segmentation model based on the sample input and the supplementary token vector of the current iteration round; It should be noted that the supplementary token vector in the first iteration is a randomly initialized supplementary token vector.
[0039] S4022, construct the loss function of the image segmentation model based on the main prediction results and sample labels, and perform backpropagation to update the model parameters of the image segmentation model; In branch training, another model is trained as a partner model to the main model, including: S4023, Obtain the branch prediction results obtained by the image segmentation partner model based on the sample input and the supplementary token vector of the current iteration round; S4024, construct the loss function of the image segmentation partner model based on the branch prediction segmentation results and sample labels, and perform backpropagation to update the supplementary token vector and the model parameters of the image segmentation partner model. S403, based on the updated supplementary token vector and model parameters, performs the next iteration of training until the model converges and outputs the trained image segmentation model.
[0040] Understandably, the structure of the main line image segmentation model is the same as that in the previous embodiment. The specific calculation process can be referred to the description in steps S201-S206 and S301-S304, which will not be repeated here.
[0041] In S4022, the loss function of the mainline image segmentation model can be expressed as: ; Where x represents the sample input and y represents the sample label. Indicates the main decoder. This indicates the main backbone network.
[0042] Understandably, the values of the above loss function can be used for backpropagation to update the parameters of other models in the image segmentation model besides the main backbone network, including the parameters of the main decoder and the parameters of the supplementary token module.
[0043] In some embodiments, the image segmentation partner model includes a partner backbone network, a partner decoder, and N token query modules; The process of obtaining branch prediction results in S4023 specifically includes: S501, each sub-token of the supplementary token vector in the current iteration round. Input the corresponding token query module T2Q Bi to generate the corresponding result query representation. , q Indicates the number of dimensions in the fusion query; S502, based on each token query module Result query representation Summing yields the fused query vector. The formula is expressed as: ; S503 extracts features based on sample input x through the partner backbone network and outputs sample feature maps. S504, the fused query vector The sample feature map is input into the partner decoder, and the branch prediction result is obtained.
[0044] It should be noted that by integrating the supplementary token vector into the learnable query vector and then transmitting it to the partner decoder of the partner network, the supplementary token vector can learn the feature extraction and processing patterns of the partner network. The supplementary token vector can be updated through backpropagation of the loss function. In the next iteration and during the actual inference process of the model, processing can continue based on the updated supplementary token vector. This allows the supplementary token vector, which has learned the feature representation of the partner network, to be integrated into the main image segmentation model through the supplementary token module. This enables the generation of supplementary features based on the supplementary token vector to enhance the representation of the image segmentation model.
[0045] Understandably, the image segmentation partner model is only used during training and is discarded during inference to avoid increasing the inference time of the main image segmentation model, thereby improving the model's efficiency while increasing its segmentation accuracy.
[0046] It should be noted that the partner decoder of the image segmentation partner model adopts Mask2Former. In the Mask2Former decoding process, the query vector, as a set of learnable instance-level representations, interacts layer by layer with the multi-scale feature maps from the encoder through a cross-attention mechanism. As the decoding layers iterate, the query vector can continuously absorb semantic and boundary information from different scales, and its internal representation gradually evolves from the initial abstract embedding to high-level features with explicit instance semantics.
[0047] In some embodiments, the process of S501 generating the corresponding result query representation specifically includes: The i-th sub-token of the supplementary token vector for the current iteration round. Input the i-th token query module, and query the i-th token based on each vector of the i-th sub-token. Perform linear transformations on each vector to obtain the query vector corresponding to each vector. ; The maximum query vector component is calculated based on all query vectors obtained from the i-th token query module. and average query vector components ; Component of the maximum query vector Average query vector components and the last vector of the i-th sub-token Corresponding query vector Connect the results, perform a linear transformation on the connection results, and map them to obtain the result query representation of the i-th token query module. The formula is expressed as: ; in, represents the connection function, and MLP represents the multilayer perceptron.
[0048] In one alternative of the first aspect, the loss function for constructing the image segmentation partner model based on the branch prediction segmentation results and sample labels is expressed as follows: ; in, Indicates sample input, Indicates a partner decoder, Indicates a partner backbone network, Indicates sample label, This represents the supplementary token vector.
[0049] Understandably, the values of the above loss function can be used for backpropagation to update the parameters of other models in the image segmentation partner model besides the partner backbone, including the parameters of the partner decoder and the token query module, thereby updating the supplementary token vector for the current iteration. .
[0050] In some specific embodiments, the following four types of models can be selected for processing to obtain the prediction results of each model for the same input image, including: Model A, a ResNet-18 model trained alone; Model B, a Convnext-Small model trained alone; Model C, a ResNet-18 model supplemented by Convnext-Small trained with Convnext-Small as a partner model and ResNet-18 as the main model; and Model D, a Convnext-Small model supplemented by ResNet-18 trained with ResNet-18 as a partner model and Convnext-Small as the main model.
[0051] Two input images, F1 and F2, are input into the four-class model AD, respectively. The comparison results obtained based on input image F1 are as follows: Figure 4 As shown, the comparison results obtained based on the input image F2 are as follows: Figure 5 As shown, Figure 4 , Figure 5The first column is the input image, the second column is the prediction result of model A, the third column is the prediction result of model C, the fourth column is the prediction result of model B, the fifth column is the prediction result of model D, and the sixth column is the true segmentation result (label) corresponding to the input image.
[0052] like Figure 4 As shown, comparing the segmentation results of Model A and Model C, it can be found that the ResNet-18 model supplemented by Convnext-Small makes the segmentation results of the airplane and grass interiors more accurate; comparing the segmentation results of Model B and Model D, it can be found that the Convnext-Small model supplemented by ResNet-18 makes the boundaries between people and vehicles clearer. Similarly, as... Figure 5 As shown, comparing the segmentation results of Model A and Model C, it can be found that the ResNet-18 model supplemented by Convnext-Small makes the segmentation of flower pots more in line with reality, and the segmentation of bathtubs is more inclined to the true label; comparing the segmentation results of Model B and Model D, it can be found that the Convnext-Small model supplemented by ResNet-18 makes the segmentation boundary of towels clearer and removes some incorrect segmentation cases.
[0053] Therefore, the mainline model trained by this application, supplemented by a partner network, obtains feature representation patterns from the partner networks of the branches using learnable supplementary token vectors and integrates them into the mainline model, thereby enriching the overall semantic representation of the mainline model, improving the performance of semantic segmentation, achieving better segmentation results than models trained alone, enabling the model to further improve accuracy even when reaching performance bottlenecks, and making the segmentation results more consistent with the actual situation, and more accurately identifying the boundaries between different targets and between targets and the background.
[0054] The following are embodiments of the apparatus for overcoming performance bottlenecks according to this application, which can be used to execute the method embodiments of this application. For details not disclosed in the apparatus embodiments of this application, please refer to the method embodiments of this application.
[0055] Please see below. Figure 6 The image segmentation apparatus provided in this application is a schematic diagram of an exemplary embodiment. The apparatus includes: The data acquisition module acquires the image to be processed and the supplementary token vector obtained during training. An image processing module is used to take the supplementary token vector and the image to be processed as input to an image segmentation model; wherein, the image segmentation model includes a main backbone network and a main decoder, and the processing procedure of the image segmentation model includes: The image processing module is also used to extract features from the image to be processed through the main backbone network and obtain a fused feature map by combining the supplementary token vector; The image processing module is further configured to decode the fused feature map using the main decoder to obtain the image segmentation result corresponding to the image to be processed.
[0056] It should be noted that the apparatus provided in the above embodiments is only illustrated by the division of the above functional modules when performing an image segmentation method. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus and method embodiments provided in the above embodiments belong to the same concept, and the implementation process can be found in the method embodiments, which will not be repeated here.
[0057] This application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of any of the methods described above.
[0058] Please see Figure 7 This is a structural block diagram of an electronic device provided in an embodiment of this application.
[0059] like Figure 7 As shown, the electronic device includes a processor and a memory.
[0060] In this embodiment, the processor is the control center of the computer system, and can be a processor of a physical machine or a processor of a virtual machine. The processor may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc. The processor can be implemented using at least one hardware form of DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), or PLA (Programmable Logic Array).
[0061] A processor can also include a main processor and a coprocessor. The main processor is used to process data in the wake-up state and is also called the CPU (Central Processing Unit). The coprocessor is a low-power processor used to process data in the standby state.
[0062] The memory may include one or more computer-readable storage media, which may be non-transitory. The memory may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices or flash memory devices. In some embodiments of this application, the non-transitory computer-readable storage media in the memory are used to store at least one instruction, which is executed by a processor to implement the methods in the embodiments of this application.
[0063] In some embodiments, the electronic device further includes a peripheral device interface and at least one peripheral device. The processor, memory, and peripheral device interface are connected via a bus or signal line. Each peripheral device is connected to the peripheral device interface via a bus, signal line, or circuit board. Specifically, the peripheral device includes: a display screen, a camera, and audio circuitry. The peripheral device interface can be used to connect at least one I / O (Input / Output) related peripheral device to the processor and memory.
[0064] In some embodiments of this application, the processor, memory, and peripheral device interfaces are integrated on the same chip or circuit board; in other embodiments of this application, any one or two of the processor, memory, and peripheral device interfaces can be implemented on separate chips or circuit boards. This application does not specifically limit the implementation in this regard.
[0065] The electronic device structural block diagrams shown in the embodiments of this application do not constitute a limitation on the electronic device. The electronic device may include more or fewer components than shown, or combine certain components, or use different component arrangements.
[0066] This application also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the methods in any of the foregoing embodiments. The computer-readable storage medium may include, but is not limited to, any type of disk, including floppy disks, optical disks, DVDs, CD-ROMs, microdrives, as well as magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic cards or optical cards, nanosystems (including molecular memory ICs), or any type of medium or device suitable for storing instructions and / or data.
[0067] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the parts that contribute to the related technology, can be embodied in the form of software products. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0068] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.
Claims
1. An image segmentation method, characterized in that, include: Obtain the image to be processed and the supplementary token vector obtained during training; The supplementary token vector and the image to be processed are used as inputs to the image segmentation model; wherein, the image segmentation model includes a main backbone network and a main decoder, and the processing procedure of the image segmentation model includes: Feature extraction is performed on the image to be processed using the main backbone network, and a fused feature map is obtained by combining the supplementary token vector. The fused feature map is decoded by the main decoder to obtain the image segmentation result corresponding to the image to be processed.
2. The image segmentation method according to claim 1, characterized in that, The main backbone network includes N processing layers, which are connected sequentially in a preset order. Each processing layer corresponds to a supplementary token module, and each processing layer includes K operation modules. Each operation module is used to perform a preset type of operation on the input feature map. The supplementary token vector includes N sub-tokens, and each sub-token includes K-1 vectors. The step of extracting features from the image to be processed through the main backbone network and obtaining a fused feature map by combining the supplementary token vector includes: The fusion feature map of the (i-1)th processing layer is input into the ith processing layer; wherein, the fusion feature map of the first processing layer is obtained based on the first sub-token of the supplementary token vector and the image to be processed; In the i-th processing layer, the (j+1)-th operation module of the i-th processing layer takes the output feature map of the j-th operation module as input and outputs the (j+1)-th intermediate feature map; wherein, the output feature map of the first operation module of the i-th processing layer is obtained based on the fused feature map of the (i-1)-th processing layer; The supplementary token module of the i-th processing layer processes the output feature map of the j-th operation module and the j-th vector of the i-th sub-token to obtain the j-th supplementary feature map; The output feature map of the (j+1)th operation module is obtained by weighted summation of the (j+1)th intermediate feature map and the jth supplementary feature map. The process continues until the output feature map of the Kth operation module is obtained, which is then used as the fusion feature map of the i-th processing layer. The fused feature map of the i-th processing layer is used as the input of the (i+1)-th processing layer until the fused feature map of the N-th processing layer is obtained.
3. The image segmentation method according to claim 2, characterized in that, The supplementary token module of the i-th processing layer obtains the j-th supplementary feature map based on the output feature map of the j-th operation module and the j-th vector of the i-th sub-token, including: The key vector and value vector are obtained by processing the output feature map of the j-th operation module; The query vector is obtained by processing the j-th vector based on the i-th sub-token; The key vector, value vector, and query vector are input into the cross-attention mechanism to obtain the attention feature map; The j-th supplementary feature map is obtained by fusing the attention feature map and the output feature map of the j-th operation module based on the multilayer perceptron.
4. The image segmentation method according to claim 3, characterized in that, The training process of the image segmentation model specifically includes: Obtain a pre-constructed sample set; wherein the sample set includes multiple samples, the sample input of each sample is the original image, and the sample label is the true segmentation result obtained based on the annotation of the original image; Training is performed in multiple iterations based on the sample set, with main training and branch training performed in each iteration. In the mainline training, the mainline prediction result obtained by the image segmentation model based on the sample input and the supplementary token vector of the current iteration is obtained; The loss function of the image segmentation model is constructed based on the main prediction results and sample labels, and backpropagation is performed to update the model parameters of the image segmentation model. In branch training, the branch prediction results obtained by the image segmentation partner model based on the sample input and the supplementary token vector of the current iteration are acquired. The loss function of the image segmentation partner model is constructed based on the branch prediction segmentation results and sample labels, and backpropagation is performed to update the supplementary token vector and the model parameters of the image segmentation partner model. The next iteration of training is performed based on the updated supplementary token vector and model parameters until the model converges, and the trained image segmentation model is output.
5. The image segmentation method according to claim 4, characterized in that, The image segmentation partner model includes a partner backbone network, a partner decoder, and N token query modules; The acquisition of the branch prediction results obtained by the image segmentation partner model based on the sample input and the supplementary token vector of the current iteration includes: Input each sub-token of the supplementary token vector in the current iteration round into the corresponding token query module to generate the corresponding result query representation; The fused query vector is obtained by summing the query results from each token query module. Feature extraction is performed based on sample input through the partner backbone network, and sample feature maps are output. The fused query vector and the sample feature map are input into the partner decoder to obtain the branch prediction result.
6. The image segmentation method according to claim 5, characterized in that, The step of inputting each sub-token of the supplementary token vector in the current iteration round into the corresponding token query module to generate the corresponding result query representation includes: Input the i-th sub-token of the supplementary token vector in the current iteration into the i-th token query module. The i-th token query module performs a linear transformation on each vector of the i-th sub-token to obtain the query vector corresponding to each vector. The maximum query vector component and the average query vector component are calculated based on all query vectors obtained from the i-th token query module. The query vector corresponding to the last vector of the i-th sub-token is concatenated with the maximum query vector component, the average query vector component, and the query vector corresponding to the last vector of the i-th sub-token. The concatenation result is then linearly transformed to obtain the result query representation of the i-th token query module.
7. The image segmentation method according to claim 6, characterized in that, The loss function for constructing the image segmentation partner model based on the branch prediction segmentation results and sample labels is expressed as follows: ; in, Indicates sample input, Indicates a partner decoder, Indicates a partner backbone network, Indicates sample label, This represents the supplementary token vector.
8. An image segmentation apparatus, characterized in that, include: The data acquisition module acquires the image to be processed and the supplementary token vector obtained during training. An image processing module is used to take the supplementary token vector and the image to be processed as input to an image segmentation model; wherein, the image segmentation model includes a main backbone network and a main decoder, and the processing procedure of the image segmentation model includes: The image processing module is also used to extract features from the image to be processed through the main backbone network and obtain a fused feature map by combining the supplementary token vector; The image processing module is further configured to decode the fused feature map using the main decoder to obtain the image segmentation result corresponding to the image to be processed.
9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps of the method as described in any one of claims 1 to 7.
10. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method as described in any one of claims 1 to 7.