Image segmentation model training, image segmentation method, device and electronic equipment
By using a lightweight feature extractor and transposed convolution to restore feature map resolution, this method solves the problems of low efficiency and poor accuracy in existing image segmentation methods, and achieves efficient and high-precision image segmentation on computing-limited devices.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- AGRICULTURAL BANK OF CHINA
- Filing Date
- 2022-09-30
- Publication Date
- 2026-06-23
AI Technical Summary
Existing image segmentation methods are inefficient and inaccurate, especially on embedded devices with limited computing resources, where real-time and high-precision image segmentation is difficult to achieve.
A lightweight feature extractor is used to extract features from the sample images. M groups of semantic segmentation feature maps are fused through a feature fusion sub-model. The resolution of the fused feature maps is restored to the original image resolution by using a transposed convolutional layer, thereby reducing information loss.
It improves the efficiency and accuracy of image segmentation, is suitable for devices with limited computing resources, and achieves higher real-time performance and accuracy.
Smart Images

Figure CN115471661B_ABST
Abstract
Description
Technical Field
[0001] This application relates to computer vision technology, and more particularly to an image segmentation model training, image segmentation method, apparatus, and electronic device. Background Technology
[0002] Image semantic segmentation technology is of great significance to computer vision fields such as autonomous driving, wearable devices, and intelligent security inspection. Currently, existing image semantic segmentation methods mainly involve: first, extracting semantic features from the image using a complex deep convolutional neural network structure to obtain a low-resolution feature image; then, using interpolation methods to restore the size of this low-resolution image to obtain a segmentation result with the same size as the original image. However, existing image segmentation methods suffer from low efficiency and poor accuracy. Summary of the Invention
[0003] This application provides an image segmentation model training, image segmentation method, apparatus, and electronic device to improve the efficiency and accuracy of image segmentation.
[0004] In a first aspect, this application provides an image segmentation model training method, the method comprising:
[0005] Obtain a sample image set; the sample image set includes: at least one sample image, and a label corresponding to each sample image; the label is used to characterize the object to be segmented in the sample image;
[0006] Based on the sample images in the sample image set and the label corresponding to each sample image, the neural network model is trained N times to obtain a trained image segmentation model; the neural network model includes: a lightweight feature extraction sub-model, a feature fusion sub-model, and a feature complex atom model, wherein the i-th round of training includes:
[0007] The sample image is used to extract features through M lightweight feature extractors in the lightweight feature extraction sub-model to obtain M sets of semantic segmentation feature maps; where N≥1 and M≥2.
[0008] The feature fusion sub-model is used to fuse the M groups of semantic segmentation feature maps to obtain a fused feature map; the resolution of the fused feature map is smaller than the resolution of the sample image.
[0009] By using at least one transposed convolutional layer in the feature complex atom model, the resolution of the fused feature map is increased to the resolution of the sample image to obtain the predicted segmentation result;
[0010] Based on the predicted segmentation results and the labels corresponding to the sample images, the neural network model is trained for the (i+1)th round.
[0011] Optionally, the lightweight feature extractor includes: at least one multi-channel convolutional layer, a first preset number of convolutional layers, and a second preset number of dilated convolutional layers; the first preset number is greater than the second preset number; the second preset number of dilated convolutional layers are interspersed among the first preset number of convolutional layers.
[0012] The step involves extracting features from the sample image using M lightweight feature extractors in the lightweight feature extraction sub-model to obtain M sets of semantic segmentation feature maps, including:
[0013] For the j-th lightweight feature extractor, features are extracted from multiple color channels of the (j-1)-th group of semantic segmentation feature maps through the at least one multi-channel convolutional layer to obtain a sub-semantic segmentation feature map; where j is greater than 1 and less than or equal to M; the input of the at least one multi-channel convolutional layer of the first lightweight feature extractor is the sample image;
[0014] The sub-semantic segmentation feature map is extracted by the convolutional layer and the dilated convolutional layer to obtain the j-th semantic segmentation feature map.
[0015] Optionally, M equals 3, the resolution of the first group of semantic segmentation feature maps is greater than the resolution of the second group of semantic segmentation feature maps, and the resolution of the second group of semantic segmentation feature maps is greater than the resolution of the third group of semantic segmentation feature maps. The step of fusing the M groups of semantic segmentation feature maps using the feature fusion sub-model to obtain a fused feature map includes:
[0016] The third group of semantic segmentation feature maps is transposed and convolved to obtain the intermediate feature map corresponding to the third group; the resolution of the intermediate feature map corresponding to the third group is the same as the resolution of the second group of semantic segmentation feature maps;
[0017] The intermediate feature map corresponding to the third group and the semantic segmentation feature map of the second group are fused to obtain a first fused feature map; the resolution of the first fused feature map is the same as the resolution of the semantic segmentation feature map of the second group.
[0018] The first fused feature map is transposed and convolved to obtain the first fused feature map with increased resolution; the resolution of the first fused feature map with increased resolution is the same as the resolution of the first group of semantic segmentation feature maps;
[0019] The first fused feature map after the resolution is increased, and the first group of semantic segmentation feature maps are fused to obtain a fused feature map.
[0020] Optionally, the step of fusing the first fused feature map after increasing the resolution and the first group of semantic segmentation feature maps to obtain a fused feature map includes:
[0021] The first fused feature map after the resolution is increased, and the first group of semantic segmentation feature maps are fused to obtain an initial fused feature map;
[0022] Perform spatial dimension convolution on the initial fused feature map to obtain an initial fused feature map with changed spatial dimensions;
[0023] Semantic features are extracted from the initial fusion feature map with the changed spatial dimension to obtain the image semantic feature map corresponding to the initial fusion feature map with the changed spatial dimension.
[0024] The image semantic feature map corresponding to the initial fusion feature map with the changed spatial dimension is subjected to spatial dimension convolution and semantic feature extraction to obtain the fusion feature map.
[0025] Secondly, this application provides an image segmentation method, the method comprising:
[0026] Receive a target image; the target image includes the target object to be segmented.
[0027] The target image is input into a trained image segmentation model to obtain an image segmentation result corresponding to the target image; the trained image segmentation model is trained using the method described in any one of the first aspects; in the image segmentation result, the pixel value corresponding to the target object is different from the pixel value of the region other than the target object.
[0028] Thirdly, this application provides an image segmentation model training apparatus, the apparatus comprising:
[0029] An acquisition module is used to acquire a sample image set; the sample image set includes: at least one sample image, and a label corresponding to each sample image; the label is used to characterize the object to be segmented in the sample image;
[0030] The training module is used to train the neural network model N times based on the sample images in the sample image set and the label corresponding to each sample image, to obtain a trained image segmentation model; the neural network model includes: a lightweight feature extraction sub-model, a feature fusion sub-model, and a feature complex atom model, wherein, in the i-th round of training, the training module is specifically used for:
[0031] The sample image is used to extract features through M lightweight feature extractors in the lightweight feature extraction sub-model to obtain M sets of semantic segmentation feature maps; where N≥1 and M≥2.
[0032] The feature fusion sub-model is used to fuse the M groups of semantic segmentation feature maps to obtain a fused feature map; the resolution of the fused feature map is smaller than the resolution of the sample image.
[0033] By using at least one transposed convolutional layer in the feature complex atom model, the resolution of the fused feature map is increased to the resolution of the sample image to obtain the predicted segmentation result;
[0034] Based on the predicted segmentation results and the labels corresponding to the sample images, the neural network model is trained for the (i+1)th round.
[0035] Fourthly, this application provides an image segmentation apparatus, the apparatus comprising:
[0036] A receiving module is used to receive a target image; the target image includes the target object to be segmented.
[0037] The processing module is used to input the target image into a trained image segmentation model to obtain an image segmentation result corresponding to the target image; the trained image segmentation model is trained using the method described in any one of the first aspects; in the image segmentation result, the pixel value corresponding to the target object is different from the pixel value of the region other than the target object.
[0038] Fifthly, this application provides an electronic device, which includes a memory and a processor;
[0039] The memory contains computer programs;
[0040] The processor is configured to perform the method described in either the first or second aspect via the computer program.
[0041] In a sixth aspect, this application provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the method described in either the first or second aspect.
[0042] In a seventh aspect, this application provides a computer program product, including a computer program that, when executed by a processor, implements the method described in either the first or second aspect.
[0043] The image segmentation model training, image segmentation method, apparatus, and electronic device provided in this application extract features from sample images using a lightweight feature extractor, reducing the network complexity for feature extraction and thus improving the efficiency of training the sample images. Furthermore, it improves the efficiency of image segmentation based on the trained image segmentation model. By fusing M sets of semantic segmentation feature maps through a feature fusion sub-model, the fused feature map incorporates all M sets of semantic segmentation feature maps. Compared to existing methods that only use the last semantic segmentation feature map for feature reconstruction, this fused feature map includes more hierarchical semantic segmentation features, thereby improving the accuracy of image segmentation model training based on this fused feature map, and consequently improving the accuracy of image segmentation based on the trained image segmentation model. By using at least one transposed convolutional layer, the resolution of the fused feature map is increased to the resolution of the sample image. Compared with traditional interpolation methods, the parameters of the transposed convolution can learn themselves repeatedly based on the gradient, and can learn the texture features of the neighboring pixels of the pixel to be restored, thereby generating upsampling results with better semantic continuity. This reduces the information loss caused in the image feature extraction stage, thus further improving the accuracy of image segmentation model training, and further improving the accuracy of image segmentation based on the trained image segmentation model. Attached Figure Description
[0044] To more clearly illustrate the technical solutions in this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0045] Figure 1 This is a schematic diagram of image-speech segmentation;
[0046] Figure 2 A flowchart illustrating an image segmentation model training method provided in this application;
[0047] Figure 3 A schematic diagram of the architecture of the neural network model provided in this application;
[0048] Figure 4 A schematic diagram of the architecture of a feature extractor provided in this application;
[0049] Figure 5 A schematic diagram of the architecture of a feature fusion device provided in this application;
[0050] Figure 6 A schematic diagram of the architecture of a feature restorer provided in this application;
[0051] Figure 7 This application provides a schematic diagram of the architecture of a deep learning trainer;
[0052] Figure 8 A flowchart illustrating an image segmentation method provided in this application;
[0053] Figure 9 A schematic diagram of the architecture of a trained image segmentation model provided in this application;
[0054] Figure 10 This application provides an architectural diagram of a semantic segmenter.
[0055] Figure 11 A schematic diagram of a semantic segmentation result image provided in this application;
[0056] Figure 12 A schematic diagram of the structure of an image segmentation model training device provided in this application;
[0057] Figure 13 A schematic diagram of the structure of an image segmentation device provided in this application;
[0058] Figure 14 This is a schematic diagram of an electronic device structure provided in this application.
[0059] The accompanying drawings illustrate specific embodiments of this application, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concept of this application to those skilled in the art through reference to particular embodiments. Detailed Implementation
[0060] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0061] The following is a brief explanation of some of the terms and concepts used in this application:
[0062] Image semantic segmentation: Image semantic segmentation is a technique and process that divides an image into several specific regions with unique properties. For example, Figure 1 This is a schematic diagram of image-to-speech segmentation. For example... Figure 1 As shown, semantic segmentation of the original image on the left can separate the dog and cat from other objects in the image, resulting in the image segmentation result on the right.
[0063] Image classification networks: Image classification networks are predictive networks implemented using deep learning neural networks for image classification. Their input is the original image data, and their output is the target category corresponding to that image. The difference between image classification networks and semantic segmentation is that semantic segmentation outputs the target category pixel-by-pixel in the original image, while image classification networks output the category to which the entire image belongs.
[0064] Convolution (standard convolution): In deep learning, convolution can be viewed as a process where a convolution kernel k*k slides across the input image f*f according to the feature stride, performing convolution calculations to generate an output feature map G. Typically, the size of the generated feature map G is smaller than the input image. It should be understood that the image size and image resolution are the same concept in this application.
[0065] Dilated convolution: refers to the technique of filling the convolution kernel of standard convolution with zero elements to make the effective convolution kernel larger, thereby making the feature map produced after the convolution operation smaller.
[0066] Transposed convolution: This is a type of convolution operation that differs from ordinary convolution in that the size of its input feature map is smaller than the size of its output feature map.
[0067] Receptive field of vision: In biology, the receptive field of vision refers to the area of stimulation reflected by a neuron when a sensory organ is stimulated. This area transmits nerve impulses (various sensory information) to the higher-level central nervous system via afferent neurons. The size and nature of the receptive field of vision vary depending on the type of sensation. In deep learning, the receptive field of vision can be understood as the size of the region mapped to the input image by each pixel in the output feature map. The larger this region, the more pixels in the output feature map are generated by convolution of more pixels in the original image, resulting in a better representation of the original image's information.
[0068] Image semantic segmentation technology is of great significance to computer vision fields such as autonomous driving and wearable devices. Currently, common image semantic segmentation methods are based on convolutional neural networks.
[0069] In image semantic segmentation based on convolutional neural networks (CNNs), deep CNNs can be used to extract semantic features from the original image, resulting in a low-resolution feature image. This low-resolution image is then resized to obtain a segmentation result with the same size as the original image. Currently, complex feature extraction network structures are used to extract features from the original image, thus ensuring the accuracy of the network's image semantic segmentation. Furthermore, when resizing the low-resolution image, interpolation methods are currently used to restore the low-resolution feature image to the original image size.
[0070] However, complex feature extraction network structures result in high computational costs, low image segmentation efficiency, and an inability to achieve real-time performance. Therefore, they are unsuitable for real-time semantic segmentation applications with limited computing resources, such as embedded devices. Furthermore, the interpolation process described above, which directly uses the values of adjacent pixels to fill the image, fails to effectively compensate for the information loss caused during the image feature extraction stage, leading to poor image segmentation accuracy. Therefore, existing image segmentation methods suffer from both low efficiency and poor accuracy.
[0071] Considering the aforementioned problems with existing image segmentation methods, this application proposes a method that extracts features from the original image using a lightweight (lightweight and lightweight in this application refer to the same concept) convolutional neural network and restores the feature image to the original image size using transposed convolution. Using a lightweight convolutional neural network for feature extraction eliminates the need for complex network structures, thus improving image segmentation efficiency. Restoring the feature image to the original image size using transposed convolution reduces information loss during the feature extraction stage, thereby improving image segmentation accuracy.
[0072] This application first proposes an image segmentation model training method to obtain a trained image segmentation model. Then, the trained image segmentation model is used to segment the original image.
[0073] The image segmentation model training method provided in this application will be described in detail below with reference to specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments. The executing entity of this image segmentation model training method can be any electronic device with processing capabilities, such as a terminal or server.
[0074] Figure 2 This is a flowchart illustrating an image segmentation model training method provided in this application. Figure 2 As shown, the method includes the following steps:
[0075] S101. Obtain the sample image set.
[0076] The aforementioned sample image set may include: at least one sample image, and a label corresponding to each sample image. The label is used to characterize the object to be segmented in the sample image.
[0077] The objects to be segmented can be related to the function of the image segmentation model. For example, taking the image segmentation model as an example of segmenting a cat and a dog in an image, assume... Figure 1Taking the original image shown above as an example, the corresponding label for this sample image can be used to characterize the cat and dog in the sample image (that is, the cat and dog are the objects to be segmented in the sample image). Taking the image segmentation model for segmenting vehicles on a road as an example, the objects to be segmented are the vehicles in the sample image.
[0078] It should be understood that this application does not limit the form of the aforementioned label. For example, the label may be a set of contour points that include the object to be segmented.
[0079] Optionally, the electronic device may receive the aforementioned sample image set input by the user through an application programming interface (API) or a graphical user interface (GUI).
[0080] S102. Based on the sample images in the sample image set and the label corresponding to each sample image, train the neural network model for N rounds (N≥1) to obtain a trained image segmentation model. This neural network model may include: a lightweight feature extraction sub-model, a feature fusion sub-model, and a feature complex atom model.
[0081] The trained image segmentation model described above can be used to perform semantic segmentation on the input target image to obtain the corresponding image segmentation result. In this image segmentation result, the pixel values corresponding to the target object are different from the pixel values of the regions other than the target object.
[0082] The training process for the i-th round (i ≥ 1 and ≤ N) includes the following steps: For any sample image, the electronic device first extracts features from the sample image using M (M ≥ 2) lightweight feature extractors in the aforementioned lightweight feature extraction sub-model, obtaining M sets of semantic segmentation feature maps. For example, the electronic device can first extract features from the sample image using lightweight feature extractor j, obtaining the j-th set of semantic segmentation feature maps. Then, the electronic device can input this j-th set of semantic segmentation feature maps into lightweight feature extractor j+1, and extract features from the j-th set of semantic segmentation feature maps using this lightweight feature extractor j+1, obtaining the j+1-th set of semantic segmentation feature maps. This process continues, allowing the electronic device to obtain M sets of semantic segmentation feature maps using M lightweight feature extractors.
[0083] After obtaining M sets of semantic segmentation feature maps, the electronic device can perform feature fusion on the M sets of semantic segmentation feature maps using the aforementioned feature fusion sub-model to obtain a fused feature map. The resolution of this fused feature map is smaller than the resolution of the sample image.
[0084] Then, the electronic device can increase the resolution of the fused feature map to the resolution of the sample image through at least one transposed convolutional layer in the aforementioned feature complex atom model to obtain the predicted segmentation result. As mentioned above, the size of the output feature map of the transposed convolution is larger than the size of the input feature map; therefore, the resolution of the fused feature map can be increased through this transposed convolutional layer.
[0085] For example, by adjusting the kernel size of the transposed convolution, or the number of transposed convolution layers, the resolution of the fused feature map can be increased to the resolution of the sample image. The electronic device can then use this fused feature map, with a resolution equal to that of the sample image, as the predicted segmentation result.
[0086] For example, taking M equal to 3 as an example, Figure 3 This is a schematic diagram of the architecture of the neural network model provided in this application. Figure 3 As shown, for the i-th training round, the electronic device can input the sample image into lightweight feature extractor 1, and obtain the first set of semantic segmentation feature maps through lightweight feature extractor 1. Then, the electronic device can input the first set of semantic segmentation feature maps into lightweight feature extractor 2, and extract features from the first set of semantic segmentation feature maps through lightweight feature extractor 2, to obtain the second set of semantic segmentation feature maps. The electronic device can input the second set of semantic segmentation feature maps into lightweight feature extractor 3, and extract features from the second set of semantic segmentation feature maps through lightweight feature extractor 3, to obtain the third set of semantic segmentation feature maps.
[0087] Then, the electronic device can input all three sets of semantic segmentation feature maps into the feature fusion sub-model to obtain a fused feature map, which is then output to the feature complex atom model. The electronic device can then obtain the predicted segmentation result through this feature complex atom model.
[0088] After obtaining the predicted segmentation result of the i-th round, the electronic device can train the neural network model for the (i+1)-th round based on the predicted segmentation result and the label corresponding to the sample image. This process can be repeated to achieve N rounds of training for the neural network model.
[0089] It should be understood that this application does not limit the training parameters of the neural network model, such as the learning rate and batch size. Furthermore, it should be understood that this application does not limit whether the aforementioned neural network model includes other components. For example, the neural network model may also include functions such as a loss function used to train the neural network model.
[0090] In this embodiment, a lightweight feature extractor is used to extract features from the sample image, reducing the network complexity of feature extraction and thus improving the efficiency of training the sample image and the efficiency of image segmentation based on the trained image segmentation model. A feature fusion sub-model fuses M sets of semantic segmentation feature maps, resulting in a fused feature map that incorporates all M sets of semantic segmentation feature maps. Compared to existing methods that only use the last semantic segmentation feature map for feature restoration, this fused feature map includes more hierarchical semantic segmentation features, thereby improving the accuracy of training the image segmentation model based on this fused feature map and consequently improving the accuracy of image segmentation based on the trained model. By using at least one transposed convolutional layer, the resolution of the fused feature map is increased to the resolution of the sample image. Compared to traditional interpolation methods, the parameters of the transposed convolution can repeatedly learn from the gradient, learning the texture features of neighboring pixels of the pixel to be restored, thus generating an upsampling result with better semantic continuity. This reduces information loss during the image feature extraction stage, further improving the accuracy of image segmentation model training and consequently improving the accuracy of image segmentation based on the trained model.
[0091] The following section details how electronic devices extract features from sample images using M lightweight feature extractors in a lightweight feature extraction sub-model, resulting in M sets of semantic segmentation feature maps:
[0092] As one possible implementation, the lightweight feature extractor may include: at least one multi-channel convolutional layer, a first preset number of convolutional layers, and a second preset number of dilated convolutional layers. The first preset number may be greater than the second preset number. That is, the number of dilated convolutional layers is less than the number of convolutional layers. In some embodiments, the convolutional layers may also be called standard convolutional layers. The second preset number of dilated convolutional layers may be interspersed among the first preset number of convolutional layers.
[0093] In this implementation, for the j-th lightweight feature extractor, the electronic device can first extract features from multiple color channels of the (j-1)-th semantic segmentation feature map using at least one multi-channel convolutional layer to obtain a sub-semantic segmentation feature map. Here, j is greater than 1 and less than or equal to M. The input to at least one multi-channel convolutional layer of the first lightweight feature extractor is the sample image. That is, for the first lightweight feature extractor, the electronic device can first extract features from multiple color channels of the sample image using at least one multi-channel convolutional layer to obtain a sub-semantic segmentation feature map.
[0094] Optionally, the aforementioned multiple color channels can be the RGB channels of an image. Optionally, the specific implementation of this multi-channel convolutional layer can refer to any existing multi-channel convolutional layer, which will not be elaborated here.
[0095] Then, the electronic device can extract features from the sub-semantic segmentation feature map through the above convolutional layer and dilated convolutional layer to obtain the j-th semantic segmentation feature map.
[0096] For example, taking the lightweight feature extractor as an example, which includes 3 convolutional layers and 2 dilated convolutional layers, the electronic device can, for instance, input the aforementioned sub-semantic segmentation features into the first convolutional layer, and then input the output of the first convolutional layer into a pooling layer to obtain the pooling result. Then, the electronic device can input the pooling result into the second convolutional layer, and control the output of the second convolutional layer to input into the first dilated convolutional layer, control the output of the first dilated convolutional layer to input into the third convolutional layer, and control the output of the third convolutional layer to input into the second dilated convolutional layer, and use the output of the second dilated convolutional layer as the j-th semantic segmentation feature map.
[0097] In this embodiment, a lightweight feature extractor is used for semantic feature extraction, which has low computational requirements and is suitable for application scenarios such as embedded devices with limited computing resources. By using dilated convolution, a larger receptive field of view can be obtained during feature extraction, effectively improving the problem of severe image resolution loss during the feature extraction stage, thereby improving the accuracy of model training and the accuracy of image segmentation based on the trained image segmentation model.
[0098] As another possible implementation, the number of convolutional layers in each lightweight feature extractor can also be the same as the number of dilated convolutional layers. That is, in this implementation, the lightweight feature extractor mentioned above may not include pooling layers, and all pooling layers can be replaced by dilated convolutional layers.
[0099] Taking an example where M equals 3, and the resolution of the first group of semantic segmentation feature maps is greater than that of the second group, and the resolution of the second group is greater than that of the third group, the following details how electronic devices fuse M groups of semantic segmentation feature maps using a feature fusion sub-model to obtain a fused feature map:
[0100] For example, with Figure 3 Taking the neural network model shown as an example, the first set of semantic segmentation feature maps can be the output of lightweight feature extractor 1. The second set of semantic segmentation feature maps can be the output of lightweight feature extractor 2. The third set of semantic segmentation feature maps can be the output of lightweight feature extractor 3.
[0101] As one possible implementation, the electronic device can first perform a transposed convolution on the third group of semantic segmentation feature maps to obtain the corresponding intermediate feature map for the third group. This transposed convolution increases the resolution of the third group of semantic segmentation feature maps. The resolution of this intermediate feature map corresponding to the third group is the same as the resolution of the second group of semantic segmentation feature maps.
[0102] Then, the electronic device can fuse the intermediate feature map corresponding to the third group and the semantic segmentation feature map of the second group to obtain a first fused feature map. The resolution of the first fused feature map is the same as that of the semantic segmentation feature map of the second group. In other words, the electronic device does not change the fusion resolution when fusing the intermediate feature map corresponding to the third group and the semantic segmentation feature map of the second group.
[0103] After obtaining the first fused feature map, the electronic device can perform a transposed convolution on the first fused feature map to obtain a first fused feature map with increased resolution. The resolution of this first fused feature map with increased resolution is the same as the resolution of the first group of semantic segmentation feature maps.
[0104] Then, the electronic device can perform feature fusion on the first fused feature map with increased resolution and the first group of semantic segmentation feature maps to obtain a fused feature map. The above method achieves the fusion of the first group of semantic segmentation feature maps, the second group of semantic segmentation feature maps, and the third group of semantic segmentation feature maps into a single fused feature map.
[0105] It should be understood that this application does not limit the specific implementation method of the feature fusion step performed by the electronic device. For example, the electronic device can add the intermediate feature map corresponding to the third group to the semantic segmentation feature map of the second group to achieve feature fusion and obtain a first fused feature map. The electronic device can add the first fused feature map with increased resolution to the semantic segmentation feature map of the first group to achieve feature fusion and obtain a fused feature map.
[0106] In this embodiment, the second and third semantic segmentation feature maps can be fused first to obtain a fused result (i.e., the first fused feature map). Then, this fused result is fused with the first semantic segmentation feature map to obtain a fused feature map. Through this method, multi-level cascaded feature fusion is achieved, fully utilizing and strengthening the features extracted during the feature extraction stage. This better represents the multi-level feature information of the original image, thereby improving the accuracy of model training and the accuracy of image segmentation based on the trained image segmentation model.
[0107] As another possible implementation, electronic devices can, for example, first unify the resolution of the first group of semantic segmentation feature maps, the resolution of the second group of semantic segmentation feature maps, and the resolution of the third group of semantic segmentation feature maps to the same resolution through a feature fusion sub-model, and then perform feature fusion on the semantic segmentation feature maps of each group after unifying the resolution.
[0108] For example, an electronic device can reduce the resolution of the first set of semantic segmentation feature maps to the resolution of the second set of semantic segmentation feature maps, and increase the resolution of the third set of semantic segmentation feature maps to the resolution of the second set of semantic segmentation feature maps. Through this method, the resolutions of the first, second, and third sets of semantic segmentation feature maps can all be equal to the resolution of the second set of semantic segmentation feature maps. Then, the electronic device can perform feature fusion on the semantic segmentation feature maps with unified resolution to obtain a fused feature map with the resolution of the second set of semantic segmentation feature maps.
[0109] The following section details how electronic devices fuse the first fused feature map (after increasing the resolution) and the first group of semantic segmentation feature maps to obtain the fused feature map:
[0110] As one possible implementation, the electronic device can first fuse the first fused feature map with increased resolution and the first group of semantic segmentation feature maps to obtain an initial fused feature map. Then, the electronic device can perform spatial dimension convolution on the initial fused feature map to obtain an initial fused feature map with altered spatial dimensions.
[0111] For example, taking an initial fused feature map as a 256×256×3 multidimensional matrix, by performing a spatial dimension convolution on the initial fused feature map, the initial fused feature map can be, for example, a 256×256×1 matrix. Alternatively, taking an initial fused feature map as a 256×256×1 matrix, by performing a spatial dimension convolution on the initial fused feature map, the initial fused feature map can be, for example, a 256×256×3 multidimensional matrix.
[0112] Optionally, the electronic device can, for example, perform spatial dimension convolution on the initial fused feature map using a convolutional layer capable of changing the spatial dimension of the initial fused feature map, to obtain an initial fused feature map with altered spatial dimensions. Optionally, the aforementioned convolutional layer capable of changing the spatial dimension of the initial fused feature map can refer to existing implementations of convolutional layers, and will not be elaborated here.
[0113] After obtaining the initial fusion feature map with the aforementioned spatial dimension change, the electronic device can extract semantic features from the initial fusion feature map with the spatial dimension change to obtain the image semantic feature map corresponding to the initial fusion feature map with the spatial dimension change.
[0114] For example, taking the initial fusion feature map with the spatial dimension change as a 256×256 matrix as an example, by extracting semantic features from the initial fusion feature map with the spatial dimension change, the "image semantic feature map corresponding to the initial fusion feature map with the spatial dimension change" can be, for example, a 128×128 matrix.
[0115] Optionally, the electronic device can, for example, use a convolutional layer capable of semantic feature extraction to extract semantic features from the initial fused feature map whose spatial dimensions have been altered, thereby obtaining an image semantic feature map corresponding to the initial fused feature map whose spatial dimensions have been altered. Optionally, the aforementioned convolutional layer capable of semantic feature extraction can refer to the existing implementation methods of convolutional layers, and will not be elaborated here.
[0116] Then, the electronic device can perform spatial dimension convolution and semantic feature extraction on the image semantic feature map corresponding to the initial fusion feature map with spatial dimension change to obtain the fusion feature map.
[0117] For example, taking the "image semantic feature map corresponding to the initial fusion feature map with spatial dimension change" as a matrix in the format of 256×256×3, by performing spatial dimension convolution and semantic feature extraction on the "image semantic feature map corresponding to the initial fusion feature map with spatial dimension change", the aforementioned fusion feature map can be, for example, a matrix of 128×128×1.
[0118] Optionally, the electronic device can, for example, use a convolutional layer capable of performing spatial convolution and semantic feature extraction to perform spatial convolution and semantic feature extraction on the "image semantic feature map corresponding to the initial fusion feature map with changed spatial dimensions" to obtain the fusion feature map. Optionally, the convolutional layer capable of performing spatial convolution and semantic feature extraction can refer to the existing implementation of convolutional layers, and will not be elaborated here.
[0119] As another possible implementation, the electronic device can directly fuse the first fused feature map with the increased resolution and the first group of semantic segmentation feature maps to obtain an initial fused feature map, which can then be used as the fused feature map. Alternatively, the electronic device can perform spatial dimension convolution on the initial fused feature map to obtain an initial fused feature map with altered spatial dimensions, which can then be used as the fused feature map. Furthermore, the electronic device can perform semantic feature extraction on the initial fused feature map with altered spatial dimensions to obtain an "image semantic feature map corresponding to the initial fused feature map with altered spatial dimensions," which can then be used as the fused feature map.
[0120] The following provides illustrative examples of the lightweight feature extraction sub-model, the feature fusion sub-model, and the feature complex atom model:
[0121] 1. Lightweight feature extraction sub-model
[0122] The lightweight feature extraction sub-model can include M lightweight feature extractors (also called feature extractors). For a single lightweight feature extractor, Figure 4 This is a schematic diagram of the architecture of a feature extractor provided in this application. Figure 4 As shown, the input to this feature extraction device can be the original image (the original image referred to here is the aforementioned sample image). The feature extractor can include a multi-channel convolutional pipeline (i.e., the aforementioned at least one multi-channel convolutional layer) and multiple convolution operators. The multiple convolution operators include a standard convolution operator (i.e., the aforementioned first preset number of convolutional layers) and a dilated convolution operator (i.e., the aforementioned second preset number of dilated convolutional layers). Through this feature extractor, features can be extracted from the original image to obtain intermediate feature results (i.e., the aforementioned set of semantic segmentation feature maps).
[0123] Optionally, this feature extractor can also be based on a lightweight image classification network. Each feature extractor can, for example, introduce a downsampling factor of twice the original image to continuously aggregate feature information.
[0124] In this feature extractor, a dilated convolution operator is introduced to increase the effective convolution kernel, thereby obtaining a larger receptive field and reducing feature loss during feature extraction. Optionally, a dilated convolution operator can also be used to replace the downsampling pooling structure after the last two convolutional layers, which can effectively reduce the downsampling factor during feature extraction. Furthermore, the parameters of the dilated convolution operator can be learned to achieve the function of pooling layers aggregating feature information. The aforementioned multiple convolution operators can be responsible for performing convolution operations in both spatial and planar dimensions. Spatial convolution can be used to identify information in different color channels of the original image, while planar convolution can be used to identify semantic and positional information of different receptive field sizes in the original image.
[0125] 2. Feature fusion sub-model (also known as feature fusion unit)
[0126] Figure 5 This is a schematic diagram of the architecture of a feature fusion processor provided in this application. Figure 5 As shown, the input to this feature fusion unit can be intermediate feature segmentation result A (i.e., the aforementioned third group of semantic segmentation feature maps), intermediate feature segmentation result B (i.e., the aforementioned second group of semantic segmentation feature maps), and intermediate feature segmentation result C (i.e., the aforementioned first group of semantic segmentation feature maps). This feature fusion unit can include feature alignment operators and multi-dimensional feature enhancement pipelines. The output of this feature fusion unit can be intermediate feature results (i.e., the aforementioned fused feature maps).
[0127] The feature fusion unit is used to process the multi-level features generated during the feature extraction stage. Considering that the granularity and size of the feature information represented at different feature extraction stages are different (i.e., the sizes of different groups of semantic segmentation feature maps are different), a feature alignment operator is introduced into the feature fusion unit to handle the problem of inconsistent multi-level feature sizes. Simultaneously, a multi-dimensional feature enhancement pipeline is introduced to enhance the dimensions of features of different sizes to strengthen semantic and feature information.
[0128] When the feature fusion unit fuses intermediate feature segmentation results A, B, and C, a multi-layer cascade approach can be used, as described below:
[0129] 1. Perform a transpose convolution operation on the intermediate feature segmentation result A to make it have the same size as the intermediate feature segmentation result B, and then fuse the intermediate feature segmentation result A and the intermediate feature segmentation result B to generate intermediate features;
[0130] 2. Perform a transpose convolution operation on the intermediate features obtained in the previous step to make them have the same size as the intermediate feature segmentation result C. Then fuse the above intermediate features and the intermediate feature segmentation result C to obtain the initial fused feature map.
[0131] The multidimensional feature enhancement pipeline of the feature fusion machine can employ a three-layer enhancement convolution operator. The first convolution operator enhances spatial dimensional information in the spatial direction (i.e., the initial fused feature map is convolved in the spatial dimension to obtain an initial fused feature map with a changed spatial dimension). The second convolution operator enhances image semantic information in the planar direction (i.e., the initial fused feature map with a changed spatial dimension is extracted for semantic features to obtain an image semantic feature map corresponding to the initial fused feature map with a changed spatial dimension). The third convolution operator is used to comprehensively enhance the overall fused feature (i.e., the image semantic feature map corresponding to the initial fused feature map with a changed spatial dimension is convolved in the spatial dimension and semantic features are extracted to obtain the fused feature map).
[0132] 3. Feature complex atom model (also known as feature restorer)
[0133] To restore the generated fused feature map to its original image size, feature restoration is needed to repair the feature information lost during feature extraction. Traditional methods typically use linear interpolation to restore image resolution, finding nearby integer points based on pixels in the low-resolution image according to certain rules and using the values of these integer points to fill in missing pixels in the high-resolution image. However, directly using the values of adjacent pixels to fill in the image during interpolation is insufficient to effectively compensate for the information loss caused during the image feature extraction stage, resulting in poor image segmentation accuracy.
[0134] Figure 6 This is a schematic diagram of the architecture of a feature restorer provided in this application. Figure 6 As shown, the input to this feature restorer is the aforementioned feature fusion result (fused feature map). This feature restorer may include a transposed convolution operator and a convolution parameter learning element. The output of the feature restorer can be a predicted feature map (i.e., the aforementioned predicted segmentation result).
[0135] The transposed convolution operator described above can use the chain rule of derivatives to calculate the gradients of the parameters corresponding to the loss function. The convolution parameter learning element then updates the parameters based on these gradients, thereby achieving the transposed convolution operation on the fused feature map. Compared with traditional interpolation methods, the parameters of the transposed convolution can learn themselves repeatedly based on the gradients, enabling them to learn the texture features of the neighboring pixels of the pixel to be restored, thus generating upsampling results with better semantic continuity.
[0136] Optional, Figure 7 This is a schematic diagram of the architecture of a deep learning trainer provided in this application. Figure 7 As shown, this deep learning trainer can be based on a labeled training image set (which may include a sample image set and predicted segmentation results, i.e.) Figure 7The original graph and the labeled prediction result graph shown are used to train the aforementioned neural network model. Optionally, as... Figure 7 As shown, the electronic device can also receive scene requirement parameters, making the trained image segmentation model applicable to the scene.
[0137] This deep learning trainer can include deep learning training generation components, parameter tuning operators, and result analysis components. Based on the results generated by the result analysis components, the training generation components iteratively invoke the parameter tuning operators to perform neural network architecture tuning for the batch size and learning rate, ultimately outputting the optimal training model. This optimal training model can serve as a trained image segmentation model.
[0138] Since the aforementioned neural network model consists of a feature extractor, a feature fusion unit, and a feature restorer, the parameters such as the number of convolutional layers, the number of convolutional kernels, and the channel dimension used in the feature extraction, feature fusion, and feature restore stages can all be configured. Optionally, the trained model, after learning and parameter tuning, can be tested and validated using test set data to determine the optimal training model.
[0139] After obtaining the trained image segmentation model, it can be used to perform image segmentation. The execution entity of this image segmentation method can be any electronic device with processing capabilities, such as a terminal or server. Optionally, the electronic device executing the image segmentation method can be the same electronic device as the aforementioned electronic device executing the image segmentation model training method, or they can be different electronic devices.
[0140] It should be understood that this application does not limit the application scenarios of the above image segmentation method. For example, if the image segmentation method is applied to autonomous driving, the electronic device executing the image segmentation method can be a vehicle. If the image segmentation method is applied to intelligent security inspection, the electronic device executing the image segmentation method can be a security inspection device.
[0141] The image segmentation technology solution provided in this application will be described in detail below with reference to specific embodiments. The following specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments.
[0142] Figure 8 This is a flowchart illustrating an image segmentation method provided in this application. Figure 8 As shown, the method may include the following steps:
[0143] S201, Receive target image.
[0144] The target image includes the target object to be segmented.
[0145] For example, taking an electronic device that includes an image acquisition device as an example, the electronic device can acquire an image including the target object as a target image through the image acquisition device. Alternatively, the electronic device can also use the acquired image as an initial image and use the image after preprocessing such as image enhancement on the initial image as the target image.
[0146] Alternatively, the electronic device may also receive target images sent by other devices. Or, the electronic device may also receive target images input by the user via an API or GUI.
[0147] S202. Input the target image into the trained image segmentation model to obtain the image segmentation result corresponding to the target image.
[0148] The trained image segmentation model can be obtained by training using the method described in any of the foregoing embodiments. In the above image segmentation results, the pixel values corresponding to the target object are different from the pixel values of the regions other than the target object.
[0149] In some embodiments, after acquiring the image segmentation result corresponding to the target object, the electronic device may, for example, perform a target operation corresponding to the image segmentation result based on the image segmentation result. For instance, taking a vehicle as an example, the electronic device may optionally perform path planning or control vehicle speed based on the image segmentation result.
[0150] In some embodiments, the electronic device may also generate and output a segmentation result image based on the pixel values corresponding to the target object and the pixel values of the regions other than the target object in the above image segmentation result.
[0151] In this embodiment, the trained image segmentation model can extract features from the target image using a trained lightweight feature extraction sub-model, and then perform multi-scale feature fusion on the extracted features using a feature fusion model. The image segmentation result is then obtained based on the fusion result. Because the trained image segmentation model has high efficiency and accuracy, the image segmentation method provided in this application has high image segmentation efficiency, avoids significant computational resource and time consumption, and can well meet real-time requirements. It can be used in high real-time image semantic segmentation scenarios such as embedded devices with limited computational resources, while ensuring the accuracy of semantic segmentation.
[0152] For example, Figure 9This application provides a schematic diagram of the architecture of a trained image segmentation model. The trained image segmentation model may include: a trained feature extractor, a trained feature fusion unit, and a trained feature restorer. Therefore, the trained image segmentation model can be based on the input target image (e.g., ... Figure 9 The original image to be segmented shown in the figure is output as a semantic segmentation result image (that is, the aforementioned image segmentation result).
[0153] For example, Figure 10 This is a schematic diagram of the architecture of a semantic segmenter provided in this application. Figure 10 As shown, the semantic segmenter can take as input: the image segmentation result "after feature extraction, feature fusion, and feature restoration" output by the best trained model for the current scene (i.e., the trained neural network model). Figure 10 The semantic segmenter includes the feature input shown in the figure, and scene requirement parameters. The semantic segmenter can include pixel-level classification elements and semantic recognition elements. Specifically, the semantic segmenter uses the pixel-level classification elements to classify each pixel in the input image segmentation result according to its image category based on the scene requirement input parameters. The semantic recognition element identifies the target entity class to which the pixel belongs in the original image based on this category, ultimately outputting a semantic segmentation result image.
[0154] For example, Figure 11 This is a schematic diagram of a semantic segmentation result image provided in this application. For example... Figure 11 As shown, the first column is the original input image, the second column is the original ground truth label image, and the third column is the semantic segmentation result image obtained using the image segmentation method provided in this application. The fourth and fifth columns are both semantic segmentation result images obtained using existing image segmentation methods. Figure 11 As shown, the semantic segmentation result map obtained by the image segmentation method provided in this application has high accuracy.
[0155] Figure 12 This is a schematic diagram of the structure of an image segmentation model training device provided in this application. Figure 12 As shown, the image segmentation model training device includes: an acquisition module 31 and a training module 32. Wherein,
[0156] The acquisition module 31 is used to acquire a sample image set. The sample image set includes: at least one sample image, and a label corresponding to each sample image; the label is used to characterize the object to be segmented from the sample image.
[0157] Training module 32 is used to train the neural network model N times based on the sample images in the sample image set and the label corresponding to each sample image, to obtain a trained image segmentation model. The neural network model includes: a lightweight feature extraction sub-model, a feature fusion sub-model, and a feature complex atom model. Specifically, during the i-th round of training, training module 32 is used for:
[0158] The sample image is used to extract features through M lightweight feature extractors in the lightweight feature extraction sub-model to obtain M sets of semantic segmentation feature maps; where N≥1 and M≥2.
[0159] The feature fusion sub-model is used to fuse the M groups of semantic segmentation feature maps to obtain a fused feature map; the resolution of the fused feature map is smaller than the resolution of the sample image.
[0160] By using at least one transposed convolutional layer in the feature complex atom model, the resolution of the fused feature map is increased to the resolution of the sample image to obtain the predicted segmentation result;
[0161] Based on the predicted segmentation results and the labels corresponding to the sample images, the neural network model is trained for the (i+1)th round.
[0162] Optionally, the lightweight feature extractor includes: at least one multi-channel convolutional layer, a first preset number of convolutional layers, and a second preset number of dilated convolutional layers; the first preset number is greater than the second preset number; the second preset number of dilated convolutional layers are interspersed among the first preset number of convolutional layers. In this implementation, the training module 32 is specifically used to extract features from multiple color channels of the (j-1)th group of semantic segmentation feature maps using the at least one multi-channel convolutional layer for the j-th lightweight feature extractor, obtaining a sub-semantic segmentation feature map; and to extract features from the sub-semantic segmentation feature map using the convolutional layer and the dilated convolutional layer, obtaining the j-th group of semantic segmentation feature maps. Wherein, j is greater than 1 and less than or equal to M; the input of the at least one multi-channel convolutional layer of the first lightweight feature extractor is the sample image.
[0163] Optionally, M equals 3, the resolution of the first group of semantic segmentation feature maps is greater than the resolution of the second group of semantic segmentation feature maps, and the resolution of the second group of semantic segmentation feature maps is greater than the resolution of the third group of semantic segmentation feature maps. The training module 32 is specifically used to perform transposed convolution on the third group of semantic segmentation feature maps to obtain intermediate feature maps corresponding to the third group; to perform feature fusion on the intermediate feature maps corresponding to the third group and the second group of semantic segmentation feature maps to obtain a first fused feature map; to perform transposed convolution on the first fused feature map to obtain the first fused feature map with increased resolution; and to perform feature fusion on the first fused feature map with increased resolution and the first group of semantic segmentation feature maps to obtain a fused feature map. Wherein, the resolution of the intermediate feature maps corresponding to the third group is the same as the resolution of the second group of semantic segmentation feature maps; the resolution of the first fused feature map with increased resolution is the same as the resolution of the first group of semantic segmentation feature maps.
[0164] Optionally, the training module 32 is specifically used to perform feature fusion on the first fused feature map after the resolution is increased and the first group of semantic segmentation feature maps to obtain an initial fused feature map; to perform spatial dimension convolution on the initial fused feature map to obtain an initial fused feature map with changed spatial dimensions; to perform semantic feature extraction on the initial fused feature map with changed spatial dimensions to obtain an image semantic feature map corresponding to the initial fused feature map with changed spatial dimensions; and to perform spatial dimension convolution and semantic feature extraction on the image semantic feature map corresponding to the initial fused feature map with changed spatial dimensions to obtain a fused feature map.
[0165] The image segmentation model training device provided in this application is used to execute the aforementioned image segmentation model training method embodiment. Its implementation principle and technical effect are similar, and will not be described in detail here.
[0166] Figure 13 This is a schematic diagram of the structure of an image segmentation device provided in this application. Figure 13 As shown, the image segmentation device includes a receiving module 41 and a processing module 42. Wherein,
[0167] The receiving module 41 is used to receive the target image. The target image includes the target object to be segmented.
[0168] Processing module 42 is used to input the target image into a trained image segmentation model to obtain an image segmentation result corresponding to the target image. The trained image segmentation model is trained using the method described in any of the foregoing embodiments; in the image segmentation result, the pixel value corresponding to the target object is different from the pixel value of the region other than the target object.
[0169] The image segmentation apparatus provided in this application is used to perform the aforementioned image segmentation method embodiments. Its implementation principle and technical effect are similar, and will not be described in detail here.
[0170] Figure 14 This is a schematic diagram of an electronic device structure provided in this application. Figure 14 As shown, the electronic device 500 may include at least one processor 501 and a memory 502.
[0171] The memory 502 is used to store programs. Specifically, the program may include program code, which includes computer operation instructions.
[0172] The memory 502 may include high-speed RAM memory, and may also include non-volatile memory, such as at least one disk storage device.
[0173] The processor 501 is used to execute computer execution instructions stored in the memory 502 to implement the image segmentation model training and image segmentation method described in the foregoing method embodiments. The processor 501 may be a central processing unit (CPU), an application-specific integrated circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of this application.
[0174] Optionally, the electronic device 500 may also include a communication interface 503. In specific implementations, if the communication interface 503, memory 502, and processor 501 are implemented independently, they can be interconnected via a bus to complete communication. The bus can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc., but this does not imply that there is only one bus or one type of bus.
[0175] Optionally, in a specific implementation, if the communication interface 503, memory 502, and processor 501 are integrated on a single chip, then the communication interface 503, memory 502, and processor 501 can communicate through an internal interface.
[0176] This application also provides a computer-readable storage medium, which may include various media capable of storing program code, such as a USB flash drive, a portable hard drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk. Specifically, the computer-readable storage medium stores program instructions, which are used in the methods described in the above embodiments.
[0177] This application also provides a program product including executable instructions stored in a readable storage medium. At least one processor of an electronic device can read the executable instructions from the readable storage medium, and the execution of the executable instructions by the at least one processor causes the electronic device to implement the image segmentation model training and image segmentation methods provided in the various embodiments described above.
[0178] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this application.
Claims
1. A method for training an image segmentation model, characterized in that, The method includes: Obtain a sample image set; the sample image set includes: at least one sample image, and a label corresponding to each sample image; the label is used to characterize the object to be segmented in the sample image; Based on the sample images in the sample image set and the label corresponding to each sample image, the neural network model is trained N times to obtain a trained image segmentation model; the neural network model includes: a lightweight feature extraction sub-model, a feature fusion sub-model, and a feature complex atom model, wherein the i-th round of training includes: The sample image is used to extract features through M lightweight feature extractors in the lightweight feature extraction sub-model, resulting in M sets of semantic segmentation feature maps; the N And the M ; The feature fusion sub-model is used to fuse the M groups of semantic segmentation feature maps to obtain a fused feature map; the resolution of the fused feature map is smaller than the resolution of the sample image. By using at least one transposed convolutional layer in the feature complex atom model, the resolution of the fused feature map is increased to the resolution of the sample image to obtain the predicted segmentation result; Based on the predicted segmentation results and the labels corresponding to the sample images, the neural network model is trained for the (i+1)th round. M equals 3, the resolution of the first group of semantic segmentation feature maps is greater than the resolution of the second group of semantic segmentation feature maps, and the resolution of the second group of semantic segmentation feature maps is greater than the resolution of the third group of semantic segmentation feature maps. The step of fusing the M groups of semantic segmentation feature maps using the feature fusion sub-model to obtain a fused feature map includes: The third group of semantic segmentation feature maps is transposed and convolved to obtain the intermediate feature map corresponding to the third group; the resolution of the intermediate feature map corresponding to the third group is the same as the resolution of the second group of semantic segmentation feature maps; The intermediate feature map corresponding to the third group and the semantic segmentation feature map of the second group are fused to obtain a first fused feature map; the resolution of the first fused feature map is the same as the resolution of the semantic segmentation feature map of the second group. The first fused feature map is transposed and convolved to obtain the first fused feature map with increased resolution; the resolution of the first fused feature map with increased resolution is the same as the resolution of the first group of semantic segmentation feature maps; The first fused feature map after the resolution is increased, and the first group of semantic segmentation feature maps are fused to obtain an initial fused feature map; Perform spatial dimension convolution on the initial fused feature map to obtain an initial fused feature map with changed spatial dimensions; Semantic features are extracted from the initial fusion feature map with the changed spatial dimension to obtain the image semantic feature map corresponding to the initial fusion feature map with the changed spatial dimension. The image semantic feature map corresponding to the initial fusion feature map with the changed spatial dimension is subjected to spatial dimension convolution and semantic feature extraction to obtain the fusion feature map; the lightweight feature extractor includes: at least one multi-channel convolutional layer, a first preset number of convolutional layers, and a second preset number of dilated convolutional layers; the first preset number is greater than the second preset number; the second preset number of dilated convolutional layers are interspersed among the first preset number of convolutional layers; The step involves extracting features from the sample image using M lightweight feature extractors in the lightweight feature extraction sub-model to obtain M sets of semantic segmentation feature maps, including: For the j-th lightweight feature extractor, features are extracted from multiple color channels of the (j-1)-th group of semantic segmentation feature maps through the at least one multi-channel convolutional layer to obtain a sub-semantic segmentation feature map; where j is greater than 1 and less than or equal to M; the input of the at least one multi-channel convolutional layer of the first lightweight feature extractor is the sample image; The sub-semantic segmentation feature map is extracted by the convolutional layer and the dilated convolutional layer to obtain the j-th semantic segmentation feature map.
2. An image segmentation method, characterized in that, The method includes: Receive a target image; the target image includes the target object to be segmented. The target image is input into a trained image segmentation model to obtain an image segmentation result corresponding to the target image; the trained image segmentation model is trained using the method described in claim 1; in the image segmentation result, the pixel value corresponding to the target object is different from the pixel value of the region other than the target object.
3. An image segmentation model training device, characterized in that, The device includes: An acquisition module is used to acquire a sample image set; the sample image set includes: at least one sample image, and a label corresponding to each sample image; the label is used to characterize the object to be segmented in the sample image; The training module is used to train the neural network model N times based on the sample images in the sample image set and the label corresponding to each sample image to obtain a trained image segmentation model. The neural network model includes: a lightweight feature extraction sub-model, a feature fusion sub-model, and a feature complex atom model. The lightweight feature extractor includes: at least one multi-channel convolutional layer, a first preset number of convolutional layers, and a second preset number of dilated convolutional layers. The first preset number is greater than the second preset number. The second preset number of dilated convolutional layers are interspersed among the first preset number of convolutional layers. Specifically, the training module, used for the i-th round of training, is: The sample image is used to extract features through M lightweight feature extractors in the lightweight feature extraction sub-model, resulting in M sets of semantic segmentation feature maps; the N And the M ; The feature fusion sub-model is used to fuse the M groups of semantic segmentation feature maps to obtain a fused feature map; the resolution of the fused feature map is smaller than the resolution of the sample image. By using at least one transposed convolutional layer in the feature complex atom model, the resolution of the fused feature map is increased to the resolution of the sample image to obtain the predicted segmentation result; Based on the predicted segmentation results and the labels corresponding to the sample images, the neural network model is trained for the (i+1)th round. M equals 3, the resolution of the first group of semantic segmentation feature maps is greater than the resolution of the second group of semantic segmentation feature maps, and the resolution of the second group of semantic segmentation feature maps is greater than the resolution of the third group of semantic segmentation feature maps. The training module is specifically used for: The third group of semantic segmentation feature maps is transposed and convolved to obtain the intermediate feature map corresponding to the third group; the resolution of the intermediate feature map corresponding to the third group is the same as the resolution of the second group of semantic segmentation feature maps; The intermediate feature map corresponding to the third group and the semantic segmentation feature map of the second group are fused to obtain a first fused feature map; the resolution of the first fused feature map is the same as the resolution of the semantic segmentation feature map of the second group. The first fused feature map is transposed and convolved to obtain the first fused feature map with increased resolution; the resolution of the first fused feature map with increased resolution is the same as the resolution of the first group of semantic segmentation feature maps; The first fused feature map after the resolution is increased, and the first group of semantic segmentation feature maps are fused to obtain an initial fused feature map; Perform spatial dimension convolution on the initial fused feature map to obtain an initial fused feature map with changed spatial dimensions; Semantic features are extracted from the initial fusion feature map with the changed spatial dimension to obtain the image semantic feature map corresponding to the initial fusion feature map with the changed spatial dimension. The image semantic feature map corresponding to the initial fusion feature map with the changed spatial dimension is subjected to spatial dimension convolution and semantic feature extraction to obtain the fusion feature map; The training module is specifically used to: for the j-th lightweight feature extractor, extract features from multiple color channels of the (j-1)-th semantic segmentation feature map through the at least one multi-channel convolutional layer to obtain a sub-semantic segmentation feature map; where j is greater than 1 and less than or equal to M; the input of the at least one multi-channel convolutional layer of the first lightweight feature extractor is the sample image; The sub-semantic segmentation feature map is extracted by the convolutional layer and the dilated convolutional layer to obtain the j-th semantic segmentation feature map.
4. An image segmentation apparatus, characterized in that, The device includes: A receiving module is used to receive a target image; the target image includes the target object to be segmented. The processing module is used to input the target image into a trained image segmentation model to obtain an image segmentation result corresponding to the target image; the trained image segmentation model is trained using the method described in claim 1; in the image segmentation result, the pixel value corresponding to the target object is different from the pixel value of the region other than the target object.
5. An electronic device, characterized in that, The electronic device includes a memory and a processor; The memory contains computer programs; The processor is configured to execute the method of claim 1 or 2 via the computer program.
6. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions that, when executed by a processor, implement the method of claim 1 or 2.
7. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the method of claim 1 or 2.