An image segmentation method, device, vehicle and storage medium
By employing an attention mechanism in image segmentation to enhance and fuse instance features and semantic features bidirectionally, the problem of insufficient image segmentation accuracy and robustness in existing technologies is solved, achieving higher segmentation accuracy and driving safety.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NINGBO LOTUS ROBOTICS CO LTD
- Filing Date
- 2023-09-04
- Publication Date
- 2026-06-12
AI Technical Summary
Existing image segmentation methods fail to selectively enhance useful information when fusing semantic segmentation and instance segmentation information, resulting in poor image segmentation accuracy and robustness, which affects driving safety.
An attention mechanism is used to enhance and fuse instance features and semantic features in both directions. Multiple attention mechanisms are used to selectively enhance feature information at different stages, including using first and second attention mechanisms to complementarily enhance instance features and semantic features, and optimizing candidate box information and target category through a region candidate box network.
It improves the accuracy and robustness of image segmentation, enhancing driving safety, especially in panoramic segmentation accuracy and robustness under complex scenes and harsh conditions.
Smart Images

Figure CN117274586B_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of vehicle technology, and in particular relates to an image segmentation method, apparatus, vehicle, and storage medium. Background Technology
[0002] With the rapid development of intelligent driving technology, image segmentation has become a major research hotspot in the field of vehicle technology. Image segmentation includes semantic segmentation, instance segmentation, and panoptic segmentation. Semantic segmentation primarily predicts the semantic category of each pixel in an image, achieving pixel-level segmentation, focusing on the segmentation of the image background. Instance segmentation mainly performs pixel-level segmentation of instance objects in an image, focusing on the segmentation of foreground instances. Panoptic segmentation predicts the semantic category of each pixel in an image and assigns instance identification codes to pixels belonging to instance objects, enabling a more comprehensive scene understanding.
[0003] Semantic segmentation and instance segmentation are complementary and related. To improve the accuracy of image segmentation, semantic segmentation information and instance segmentation information can be fused. However, most current image segmentation methods perform one-way enhancement fusion of semantic segmentation information and instance segmentation information, without specifically selecting useful information from semantic segmentation information to enhance instance segmentation information, or specifically selecting useful information from instance segmentation information to enhance semantic segmentation information. This results in poor accuracy and robustness of image segmentation, affecting driving safety. Summary of the Invention
[0004] To address the aforementioned technical problems, this application provides an image segmentation method, apparatus, vehicle, and storage medium, which can improve the accuracy and robustness of image segmentation and enhance driving safety.
[0005] This application provides an image segmentation method, comprising: extracting image features of an image to be segmented to determine instance features and semantic features of the image to be segmented; fusing the instance features and semantic features based on an attention mechanism to determine a fusion result in which the instance features and semantic features mutually enhance each other; and segmenting the image to be segmented according to the fusion result in which the instance features and semantic features mutually enhance each other to determine the segmentation result of the image to be segmented.
[0006] In one embodiment, the step of fusing the instance features and the semantic features based on an attention mechanism to determine a fusion result in which the instance features and the semantic features mutually enhance each other includes: fusing a first instance feature and a first semantic feature of the image to be segmented based on a first attention mechanism to enhance the first instance feature through the first semantic feature and determine a second instance feature; fusing the first semantic feature and the second instance feature based on a second attention mechanism to enhance the first semantic feature through the second instance feature and determine a second semantic feature; and obtaining the fusion result.
[0007] In one embodiment, the step of fusing a first instance feature and a first semantic feature of the image to be segmented based on a first attention mechanism to enhance the first instance feature through the first semantic feature and determine a second instance feature includes: fusing a first instance feature and a first semantic feature of the image to be segmented based on a first attention mechanism to enhance the first instance feature through the first semantic feature and determine a third instance feature; extracting candidate box information and target category corresponding to the third instance feature based on a region candidate box network to determine a fourth instance feature; performing convolution processing on the first instance feature and the fourth instance feature to optimize the candidate box information and the target category and determine the second instance feature; wherein, the fourth instance feature includes the third instance feature, the candidate box information corresponding to the third instance feature, and the target category.
[0008] In one embodiment, the second attention mechanism includes a third attention mechanism and a fourth attention mechanism; the step of fusing the first semantic feature and the second instance feature based on the second attention mechanism to enhance the first semantic feature through the second instance feature and determine the second semantic feature includes: obtaining the fourth instance feature; fusing the first semantic feature and the fourth instance feature based on the third attention mechanism to enhance the first semantic feature through the fourth instance feature and determine the instance-enhanced first semantic feature; and fusing the instance-enhanced first semantic feature and the second instance feature based on the fourth attention mechanism to enhance the instance-enhanced first semantic feature through the second instance feature and determine the second semantic feature.
[0009] In one embodiment, the candidate box information includes the candidate box size; before performing convolution processing on the first instance feature and the fourth instance feature, the method includes: normalizing the candidate box size; before fusing the first semantic feature and the second instance feature of the instance enhancement based on the fourth attention mechanism, the method includes: upsampling the second instance feature to restore the candidate box size to the size before normalization.
[0010] In one embodiment, the step of segmenting the image to be segmented based on the fusion result of the instance features and the semantic features to determine the segmentation result of the image to be segmented includes: performing semantic segmentation on the image to be segmented based on the second semantic features to determine the semantic segmentation result of the image to be segmented.
[0011] In one embodiment, the step of segmenting the image to be segmented based on the fusion result of the instance features and the semantic features to determine the segmentation result of the image to be segmented includes: performing instance segmentation on the image to be segmented based on the second instance features to determine the instance segmentation result of the image to be segmented; performing semantic segmentation on the image to be segmented based on the second semantic features to determine the semantic segmentation result of the image to be segmented; and fusing the semantic segmentation result and the instance segmentation result to determine the panoramic segmentation result of the image to be segmented.
[0012] This application also provides an image segmentation apparatus, which includes a feature extraction module, a feature fusion module, and an image segmentation module. The feature extraction module is used to extract image features of an image to be segmented to determine instance features and semantic features of the image to be segmented. The feature fusion module is used to fuse the instance features and the semantic features based on an attention mechanism to determine a fusion result in which the instance features and the semantic features mutually enhance each other. The image segmentation module is used to segment the image to be segmented based on the fusion result in which the instance features and the semantic features mutually enhance each other, and to determine the segmentation result of the image to be segmented.
[0013] This application also provides a vehicle that includes the above-described image segmentation apparatus.
[0014] This application also provides a storage medium storing a computer program that, when executed by a processor, implements the steps of the above-described image segmentation method.
[0015] This application provides an image segmentation method, apparatus, vehicle, and storage medium that enhances and fuses instance features and semantic features based on an attention mechanism, and segments the image to be segmented according to the fusion result of the mutual enhancement of instance features and semantic features. This can improve the accuracy and robustness of image segmentation and enhance driving safety. Attached Figure Description
[0016] Figure 1 This is a flowchart illustrating the image segmentation method provided in Embodiment 1 of this application;
[0017] Figure 2 This is a schematic diagram illustrating the principle of the first attention mechanism provided in Embodiment 1 of this application;
[0018] Figure 3 This is a schematic diagram illustrating the principle of the third attention mechanism provided in Embodiment 1 of this application;
[0019] Figure 4 This is a cropping diagram of the image to be segmented provided in Embodiment 1 of this application;
[0020] Figure 5 This is a schematic flowchart of the image segmentation method provided in Embodiment 1 of this application;
[0021] Figure 6 This is a schematic diagram of the image segmentation result of the image to be segmented provided in Embodiment 1 of this application;
[0022] Figure 7 This is a schematic diagram of the image segmentation device provided in Embodiment 2 of this application. Detailed Implementation
[0023] The technical solutions of this application will be further described in detail below with reference to the accompanying drawings and specific embodiments. Unless otherwise defined, all technical and scientific terms used in this application have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit this application. The word "and / or" as used herein includes any and all combinations of one or more of the associated listed items.
[0024] Figure 1 This is a schematic flowchart of the image segmentation method provided in Embodiment 1 of this application. Figure 1 As shown, the image segmentation method of this application may include the following steps:
[0025] Step S10: Extract image features of the image to be segmented to determine the instance features and semantic features of the image to be segmented;
[0026] Step S20: Based on the attention mechanism, fuse instance features and semantic features to determine the fusion result in which instance features and semantic features mutually enhance each other;
[0027] Step S30: Based on the fusion result of the mutual enhancement of instance features and semantic features, segment the image to be segmented and determine the segmentation result of the image to be segmented.
[0028] The image segmentation method provided in Embodiment 1 of this application uses an attention mechanism to enhance and fuse instance features and semantic features, and segments the image to be segmented based on the fusion result of the mutual enhancement of instance features and semantic features. This can improve the accuracy and robustness of image segmentation and enhance driving safety.
[0029] Optionally, multi-scale image features of the image to be segmented are extracted using a feature extraction network. These multi-scale image features are then duplicated, with one copy defined as the first instance feature. The other copy undergoes convolution processing to enrich the background information of the multi-scale image features, resulting in the first semantic feature. The feature extraction network includes ResNet, EfficientNet, Feature Pyramid Network (FPN), and dual-path FPN, among others.
[0030] In one embodiment, step S20: Based on an attention mechanism, instance features and semantic features are fused to determine the fusion result in which instance features and semantic features mutually enhance each other, including:
[0031] Based on the first attention mechanism, the first instance features and the first semantic features of the image to be segmented are fused to enhance the first instance features through the first semantic features and determine the second instance features;
[0032] Based on the second attention mechanism, the first semantic feature and the second instance feature are fused to enhance the first semantic feature through the second instance feature and determine the second semantic feature;
[0033] The fusion result was obtained.
[0034] It is worth mentioning that, based on the first attention mechanism, the first semantic feature is used to enhance and fuse the first instance feature, and based on the second attention mechanism, the result of the enhancement and fusion of the first instance feature using the first semantic feature (i.e., the second instance feature) is used to enhance and fuse the first semantic feature. This can achieve bidirectional enhancement and fusion of instance features and semantic features, resulting in a fusion result in which instance features and semantic features enhance each other.
[0035] In one embodiment, based on a first attention mechanism, a first instance feature and a first semantic feature of the image to be segmented are fused to enhance the first instance feature through the first semantic feature, thereby determining a second instance feature, including:
[0036] Based on the first attention mechanism, the first instance features and the first semantic features of the image to be segmented are fused to enhance the first instance features through the first semantic features and determine the third instance features;
[0037] Based on the region candidate box network, candidate box information and target category corresponding to the third instance features are extracted to determine the fourth instance features;
[0038] Convolution processing is performed on the first instance features and the fourth instance features to optimize the candidate box information and target category, and to determine the second instance features;
[0039] The fourth instance feature includes the third instance feature, the candidate box information corresponding to the third instance feature, and the target category.
[0040] Optionally, the first attention mechanism is a location-based attention mechanism.
[0041] like Figure 2 As shown, the first semantic feature and the first instance feature are first normalized (Norm), and positional attention is used to generate the attention weight map of the first semantic feature. Then, the scale of the first semantic feature and the first instance feature are adjusted by convolution operation to make the scale of the two consistent. The attention weight map of the first semantic feature is multiplied with the first instance feature (Multiple), and then fused by 3*3 convolution (Conv(3*3)) to obtain the third instance feature.
[0042] It is worth mentioning that, based on the fact that pixels with different semantics must belong to different instances, positional attention is employed. By calculating the attention weights of information at each position in the corresponding image within the first semantic feature, an attention weight map of the first semantic feature is generated. Furthermore, based on the attention weight map of the first semantic feature, the first instance feature and the first semantic feature are fused. This allows for focusing on the information of specific regions in the corresponding image within the first semantic feature to enhance and fuse the first instance feature, improving the effectiveness of semantic feature-to-instance feature enhancement and fusion, and increasing instance segmentation accuracy.
[0043] After obtaining the third instance feature, a Region Proposal Network (RPN) is used to extract the candidate box information (such as candidate box size, coordinates, etc.) and target category (such as the target belonging to a specific category such as pedestrian or vehicle) corresponding to the third instance feature, resulting in a fourth instance feature including the third instance feature, the candidate box information corresponding to the third instance feature, and the target category. Then, by performing convolution processing on the first instance feature and the fourth instance feature, the candidate box information and target category are optimized and corrected to obtain the second instance feature.
[0044] Optionally, the second attention mechanism includes the third attention mechanism and the fourth attention mechanism.
[0045] In one embodiment, based on a second attention mechanism, a first semantic feature and a second instance feature are fused to enhance the first semantic feature through the second instance feature, thereby determining the second semantic feature, including:
[0046] Obtain the fourth instance feature;
[0047] Based on the third attention mechanism, the first semantic feature and the fourth instance feature are fused to enhance the first semantic feature through the fourth instance feature, and the first semantic feature with instance enhancement is determined.
[0048] Based on the fourth attention mechanism, the first semantic feature and the second instance feature of instance enhancement are fused together, so as to enhance the first semantic feature of instance enhancement through the second instance feature and determine the second semantic feature.
[0049] like Figure 3 As shown, firstly, the instance candidate box features corresponding to the fourth instance features are fused using a 1*1 convolution (Conv(1*1)) and a rectified linear activation function (ReLU). Secondly, the number of channels is compressed to 1 using Conv(1*1). Then, the (0,1) first attention weight map of the instance candidate box features is predicted using a sigmoid activation function. The first attention weight map and the first semantic feature are then combined using a Hadamard product to obtain the first semantic feature of the first instance enhancement. Then, the first semantic feature of the first instance enhancement is processed by global average pooling (GAP), Conv(1*1), and group normalization (GN) in sequence. Finally, the (0,1) second attention weight map of the instance candidate box features is predicted using a sigmoid activation function. The second attention weight map and the first semantic feature of the first instance enhancement are then combined using a Hadamard product to obtain the first semantic feature of the second instance enhancement.
[0050] It is worth mentioning that, based on the semantic similarity of pixels belonging to the same instance, by learning the attention weights of instance candidate box features at various scales, and obtaining the (0,1) attention weight map of instance candidate box features based on two predictions using the Sigmoid activation function, the mining of non-target pixel features within the candidate box can be enhanced, and invalid or interfering information within the candidate box can be suppressed. Furthermore, based on the (0,1) attention weight map of instance candidate box features, the first semantic feature and the fourth instance feature are fused. This can fully utilize the non-target pixel features within the candidate box in the fourth instance feature to enhance the fusion of the first semantic feature, improve the effect of instance features enhancing semantic features, and improve the semantic segmentation accuracy.
[0051] Optionally, the fourth attention mechanism can be the same as or a different attention mechanism from the third attention mechanism. After enhancing and fusing the first semantic feature with the fourth instance feature based on the third attention mechanism to obtain the instance-enhanced first semantic feature, the fourth attention mechanism is then used to enhance and fuse the instance-enhanced first semantic feature with the second instance feature to determine the second semantic feature. This is equivalent to using useful information (such as non-target pixel features within the candidate box) from instance features at different stages to enhance the first semantic feature multiple times, further improving the effect of instance features enhancing and fusing semantic features and increasing semantic segmentation accuracy, such as the segmentation accuracy of foreground object edge details. Furthermore, before enhancing and fusing the instance-enhanced first semantic feature with the second instance feature based on the fourth attention mechanism, convolution processing can be performed on the instance-enhanced first semantic feature to further enrich the background information in the instance-enhanced first semantic feature and improve semantic segmentation accuracy.
[0052] Optionally, the candidate box information includes the candidate box size. In one embodiment, before performing convolution processing on the first instance features and the fourth instance features, the candidate box size is normalized.
[0053] In one embodiment, before fusing the first semantic feature and the second instance feature based on the fourth attention mechanism, the method includes: upsampling the second instance feature to restore the candidate box size to the size before normalization.
[0054] Specifically, to improve the efficiency of convolution processing of the first and fourth instance features, the candidate box size can be normalized before convolution processing, such as unifying the candidate box size to 28*28. Correspondingly, before fusing the instance-enhanced first and second instance features based on the fourth attention mechanism, the second instance feature needs to be upsampled to restore the candidate box size to its pre-normalized size. This ensures size alignment between the instance-enhanced first and second instance features, guaranteeing the accuracy of the fusion of the instance-enhanced first semantic features with the second instance feature.
[0055] In one embodiment, step S30: based on the fusion result of instance features and semantic features, segment the image to be segmented to determine the segmentation result of the image to be segmented, including:
[0056] Based on the second semantic feature, semantic segmentation is performed on the image to be segmented to determine the semantic segmentation result of the image to be segmented.
[0057] In one embodiment, step S30: based on the fusion result of instance features and semantic features, segment the image to be segmented to determine the segmentation result of the image to be segmented, including:
[0058] Based on the second instance features, instance segmentation is performed on the image to be segmented to determine the instance segmentation result of the image to be segmented;
[0059] Based on the second semantic feature, semantic segmentation is performed on the image to be segmented to determine the semantic segmentation result of the image to be segmented;
[0060] The semantic segmentation results and instance segmentation results are fused to determine the panoramic segmentation result of the image to be segmented.
[0061] It is worth mentioning that, after the above-mentioned fusion process of instance features and semantic features, which enhances the semantic features and instance features, semantic segmentation of the image to be segmented is performed based on the second semantic feature, one of the mutually enhanced fusion results, to obtain an accurate semantic segmentation result of the image to be segmented. Similarly, instance segmentation of the image to be segmented is performed based on the second instance feature, another of the mutually enhanced fusion results, to obtain an accurate instance segmentation result of the image to be segmented. Furthermore, by fusing the semantic segmentation result and the instance segmentation result of the image to be segmented, an accurate panoramic segmentation result of the image to be segmented can be obtained.
[0062] Optionally, the image segmentation method provided in this application can be implemented based on a trained image segmentation network or other technologies.
[0063] Optionally, the image segmentation model includes a feature extraction network, multiple attention mechanisms, multiple convolutional layers, an RPN network, a region alignment block (Roi-Align), an upsampling layer (RoiUpsample), a semantic segmentation head, an instance segmentation head, and a panoptic fusion block. The feature extraction network comprises a backbone network and a bidirectional FPN network. The backbone network can be an EfficientNet network. The bidirectional FPN network consists of two FPN paths: one branch operates from top to bottom, and the other from bottom to top. The outputs of both the top-down and bottom-up branches are added accordingly for each resolution level.
[0064] Optionally, the Lotus dataset is selected, which includes RGB images captured using a vehicle-mounted far-angle and wide-angle camera, and the RGB images have dense panoramic segmentation labels. The image segmentation model is trained using image data from the Lotus dataset.
[0065] Optionally, multiple loss functions can be used, such as the initial bounding box (candidate box) regression loss function for instance segmentation, the binary cross-entropy class loss (whether there is a target in the bounding box), the optimized bounding box regression loss function, the cross-entropy class loss function (predicting whether the target in the bounding box is "person", "vehicle", etc.), and the binary cross-entropy loss and cross-entropy loss function for predicting the mask, to supervise the training process of the image segmentation model.
[0066] Optionally, the SGD optimizer is used during training, with an initial learning rate of 0.007, a learning rate decay of 0.001, and a momentum of 0.9. Training is conducted for a total of 60 epochs, and the optimal model parameters learned during training are saved to obtain the trained image segmentation model. Here, epoch is a unit representing the number of updates required when all training data has been used at least once during the learning process.
[0067] Before training the image segmentation model, the image data in the Lotus dataset needs to be preprocessed: the RGB images captured by the camera are cropped, and the required dataset JSON files and Stuff Thing Map images are generated according to the COCO dataset processing format. Specifically, the resolution of the RGB images captured by the camera is 1024*768. Since the RGB images will have black borders after correction at the far viewpoint, but are normal at the wide viewpoint, they can be processed as follows... Figure 4As shown, the RGB images acquired from the far view are cropped from the top left corner (excluding the black border) to unify the resolution of the RGB images acquired from the far view to 768*512. Correspondingly, the RGB images acquired from the wide view are cropped by extracting the middle area of the image to ensure image quality and model generalization.
[0068] It is worth mentioning that during the training of the image segmentation model, data processing techniques such as random flipping, slight scaling at multiple scales while maintaining the original proportions, and padding at arbitrary positions are used to ensure the diversity of data between each epoch.
[0069] Optionally, Figure 4 Three cropped images of different scenes are input as the images to be segmented into the trained image segmentation model, such as... Figure 5 As shown, the image segmentation model processes the image to be segmented as follows:
[0070] First, the Backbone network performs a first-stage image feature extraction on the image to be segmented, and the bidirectional FPN network performs a second-stage image feature extraction on the image to be segmented, obtaining multi-scale image features. These multi-scale image features are then copied twice and defined as multi-scale features. Multiscale features Among them, multi-scale features As the first instance feature, multi-scale features After convolution, the first semantic feature is obtained. Next, the first instance feature and the first semantic feature are input into the first attention module (STI-Module). The STI-Module fuses the first instance feature and the first semantic feature to enhance the first instance feature through the first semantic feature, resulting in the third instance feature. The third instance feature is then input into the RPN network. The RPN network extracts the candidate bounding box information (bbox) and target class corresponding to the third instance feature, resulting in the fourth instance feature, which includes the third instance feature, the corresponding candidate bounding box information, and the target class. Next, the fourth instance feature is input into the Roi-Align and third attention mechanism module (ITS-Module1). Roi-Align normalizes the candidate bounding box size in the fourth instance feature and performs convolution processing on the first instance feature and the fourth instance feature through multiple convolutional layers to optimize the candidate bounding box information and target class, resulting in the second instance feature. Simultaneously, ITS-Module2 fuses the first semantic feature and the fourth instance feature to enhance the first semantic feature through the fourth instance feature, resulting in the instance-enhanced first semantic feature. Then, the second instance features are input into RoiUpsample. RoiUpsample upsamples the second instance features, restoring the candidate box size to its pre-normalization size. The second instance features are then input into the fourth attention mechanism (ITS-Module2). Simultaneously, the first semantic feature for instance enhancement, after undergoing multiple convolutions, is also input into ITS-Module2. ITS-Module2 fuses the first and second instance features to enhance the first semantic feature using the second instance features, resulting in the second semantic feature. Finally, the second semantic feature can be input into the semantic segmentation head to obtain a semantic mask, and the semantic segmentation result corresponding to the semantic mask is output (e.g., ...). Figure 6 As shown in (a), (b), and (c) in the figure), further, the second instance features can be input into the instance segmentation header to obtain the instance mask. The instance mask and semantic mask are then input into the panoramic fusion block for fusion, and the panoramic segmentation result is output (as shown in the figure). Figure 6 As shown in (d), (e), and (f), the specific output semantic segmentation result and / or panoramic segmentation result can be selected according to actual needs.
[0071] like Figure 6 As shown in (a), (b), and (c), the trained image segmentation model successfully performed semantic segmentation on vehicles, pedestrians, sky, streetlights, road signs, lane lines, and other semantic elements in images to be segmented from different scenes. Figure 6As shown in (d), (e), and (f), based on the successful semantic segmentation of the images to be segmented in different scenes, the trained image segmentation model also successfully performed instance segmentation of different vehicles and different pedestrians in the images to be segmented in different scenes. That is, the trained image segmentation model successfully performed panoramic segmentation of the images to be segmented in different scenes, and the edge detail segmentation accuracy of foreground objects such as vehicles, pedestrians, and streetlights was high.
[0072] The image segmentation method provided in Embodiment 1 of this application is based on a top-down (i.e., detection before segmentation) image segmentation framework. It employs multiple attention mechanisms in multiple stages to selectively enhance instance features with useful information from semantic features, and enhance semantic features with useful information from instance features. This bidirectional enhancement and fusion of semantic and instance features can simultaneously improve the accuracy of semantic segmentation and instance segmentation, thereby improving the accuracy and robustness of panoramic segmentation, especially in complex scenes (such as urban scenes with many targets) and adverse conditions (such as night or fog), and enhancing driving safety.
[0073] Figure 7 This is an image segmentation apparatus provided in Embodiment 2 of this application. The image segmentation apparatus of this application includes: a feature extraction module, a feature fusion module, and an image segmentation module.
[0074] The feature extraction module is used to extract image features of the image to be segmented in order to determine the instance features and semantic features of the image to be segmented.
[0075] The feature fusion module is used to fuse instance features and semantic features based on an attention mechanism to determine the fusion result in which instance features and semantic features mutually enhance each other.
[0076] The image segmentation module is used to segment the image to be segmented based on the fusion result of the mutual enhancement of instance features and semantic features, and to determine the segmentation result of the image to be segmented.
[0077] The specific implementation principle of this embodiment is the same as that in Embodiment 1, and will not be repeated here.
[0078] The image segmentation apparatus provided in Embodiment 2 of this application achieves mutual enhancement and fusion of instance features and semantic features based on an attention mechanism through the interaction between the feature extraction module, the feature fusion module, and the image segmentation module. Based on the fusion result of the mutual enhancement of instance features and semantic features, the apparatus segments the image to be segmented, thereby improving the accuracy and robustness of image segmentation and enhancing driving safety.
[0079] This application also provides a vehicle that includes the above-described image segmentation apparatus.
[0080] This application also provides a storage medium storing a computer program that, when executed by a processor, implements the steps of the above-described image segmentation method.
[0081] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0082] In this document, the terms “comprising,” “including,” or any other variations thereof are intended to cover non-exclusive inclusion, which includes not only the elements listed but also other elements not expressly listed.
[0083] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. An image segmentation method, characterized in that, include: Extract image features from the image to be segmented to determine the instance features and semantic features of the image to be segmented; Based on an attention mechanism, the instance features and the semantic features are fused to determine the fusion result in which the instance features and the semantic features mutually enhance each other; Based on the fusion result of the mutual enhancement between the instance features and the semantic features, the image to be segmented is segmented to determine the segmentation result of the image to be segmented; The step of fusing the instance features and the semantic features based on the attention mechanism to determine the fusion result in which the instance features and the semantic features mutually enhance each other includes: Based on the first attention mechanism, the first instance features and the first semantic features of the image to be segmented are fused to enhance the first instance features through the first semantic features and determine the second instance features; Based on the second attention mechanism, the first semantic feature and the second instance feature are fused to enhance the first semantic feature through the second instance feature, thereby determining the second semantic feature; The fusion result is obtained.
2. The method as described in claim 1, characterized in that, The step of fusing the first instance features and the first semantic features of the image to be segmented based on the first attention mechanism, so as to enhance the first instance features through the first semantic features, and determining the second instance features, includes: Based on the first attention mechanism, the first instance features and the first semantic features of the image to be segmented are fused to enhance the first instance features through the first semantic features and determine the third instance features; Based on the region candidate box network, the candidate box information and target category corresponding to the third instance feature are extracted to determine the fourth instance feature; The first instance feature and the fourth instance feature are convolved to optimize the candidate box information and the target category, and the second instance feature is determined. The fourth instance feature includes the third instance feature, the candidate box information corresponding to the third instance feature, and the target category.
3. The method as described in claim 2, characterized in that, The second attention mechanism includes the third attention mechanism and the fourth attention mechanism; The step of fusing the first semantic feature and the second instance feature based on the second attention mechanism to enhance the first semantic feature through the second instance feature and determine the second semantic feature includes: Obtain the fourth instance feature; Based on the third attention mechanism, the first semantic feature and the fourth instance feature are fused to enhance the first semantic feature through the fourth instance feature, thereby determining the instance-enhanced first semantic feature; Based on the fourth attention mechanism, the first semantic feature and the second instance feature of the instance enhancement are fused together, so as to enhance the first semantic feature of the instance enhancement through the second instance feature and determine the second semantic feature.
4. The method as described in claim 3, characterized in that, The candidate box information includes the candidate box size; Before performing convolution processing on the first instance features and the fourth instance features, the process includes: The candidate box size is normalized; Before fusing the first semantic feature and the second instance feature enhanced by the instance based on the fourth attention mechanism, the process includes: The second instance features are upsampled to restore the candidate box size to its original size before normalization.
5. The method according to any one of claims 1-4, characterized in that, The step of segmenting the image to be segmented based on the fusion result of the instance features and the semantic features to determine the segmentation result of the image to be segmented includes: Based on the second semantic feature, semantic segmentation is performed on the image to be segmented to determine the semantic segmentation result of the image to be segmented.
6. The method according to any one of claims 1-4, characterized in that, The step of segmenting the image to be segmented based on the fusion result of the instance features and the semantic features to determine the segmentation result of the image to be segmented includes: Based on the second instance features, the image to be segmented is segmented into instances to determine the instance segmentation result of the image to be segmented. Based on the second semantic feature, semantic segmentation is performed on the image to be segmented to determine the semantic segmentation result of the image to be segmented; The semantic segmentation result and the instance segmentation result are fused to determine the panoramic segmentation result of the image to be segmented.
7. An image segmentation apparatus, characterized in that, The image segmentation device includes a feature extraction module, a feature fusion module, and an image segmentation module; The feature extraction module is used to extract image features of the image to be segmented in order to determine the instance features and semantic features of the image to be segmented. The feature fusion module is used to fuse the instance features and the semantic features based on an attention mechanism to determine the fusion result in which the instance features and the semantic features mutually enhance each other; and to fuse the first instance features and the first semantic features of the image to be segmented based on a first attention mechanism to enhance the first instance features through the first semantic features and determine the second instance features. Based on the second attention mechanism, the first semantic feature and the second instance feature are fused to enhance the first semantic feature through the second instance feature, thereby determining the second semantic feature; The fusion result is obtained; The image segmentation module is used to segment the image to be segmented based on the fusion result of the mutual enhancement of the instance features and the semantic features, and to determine the segmentation result of the image to be segmented.
8. A vehicle, characterized in that, The vehicle includes the image segmentation apparatus as described in claim 7.
9. A storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method as described in any one of claims 1 to 6.