A high-precision semantic segmentation method for an automatic driving road scene
By constructing a semantic segmentation network based on the ResNet model, and combining dual-branch feature extraction and efficient aggregation pyramid pooling modules, the problem of low segmentation accuracy in autonomous driving road scenarios is solved, achieving high-precision and efficient semantic segmentation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JIANGSU AEROSPACE DAWEI TECH CO LTD
- Filing Date
- 2023-11-28
- Publication Date
- 2026-06-12
Smart Images

Figure CN117649526B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of transportation technology, and in particular to a high-precision semantic segmentation method for autonomous driving road scenarios. Background Technology
[0002] Image segmentation is a key technology in image processing and machine vision, and an important component of computer vision. It enables deeper analysis and understanding of images. This technique primarily subdivides an image into different sub-regions, representing a pixel-level image resolution process. Currently, image segmentation is mainly divided into semantic segmentation, instance segmentation, and panoptic segmentation, distinguished by classifying target entities into different categories, distinct entities, and combinations thereof. Semantic segmentation is the foundation and most important aspect of image segmentation, precisely dividing regions by classifying each pixel in the image. Essentially, image segmentation is a sophisticated pixel-by-pixel regression task, primarily classifying each pixel in the image, for example, mapping the background to 0 and the foreground to the other N-1 categories.
[0003] Early traditional methods primarily aimed to achieve accurate image segmentation, mainly including three categories: region-based, threshold-based, and edge-based image segmentation methods. These were combined with specific image processing algorithms, such as morphology-based segmentation, wavelet analysis and transform-based segmentation, region level set-based segmentation, and corner-based segmentation. Subsequently, machine learning methods, such as Bayesian and SVM, began to be widely used. While these image segmentation methods can achieve a certain level of accuracy, they still rely on prior knowledge, exhibit poor robustness to complex target segmentation, weak fine-grained information extraction capabilities, and limited learning ability. They struggle to learn parameters for general models with limited samples, making them unsuitable for practical application in real-world scenarios.
[0004] Thanks to the rapid development of deep learning, convolutional neural networks (CNNs) have been applied to semantic segmentation, significantly outperforming traditional methods based on manual features. They have also made substantial progress in fields such as autonomous driving, medical image processing, satellite remote sensing, and drone navigation. Building image segmentation models using CNNs allows for end-to-end training of the algorithm. CNNs excel in parameter sharing and efficient aggregation of local information. However, semantic segmentation typically requires long-range dependencies. To integrate global information, basic CNN models need to stack many convolutional layers, resulting in relatively low accuracy for segmentation methods in autonomous driving road scenarios. Summary of the Invention
[0005] To address the aforementioned problems and technical requirements, this application proposes a high-precision semantic segmentation method for autonomous driving road scenarios. The technical solution of this application is as follows:
[0006] A high-precision semantic segmentation method for autonomous driving road scenarios, the high-precision semantic segmentation method includes:
[0007] The semantic segmentation network architecture is based on the ResNet model. The semantic segmentation network includes a feature preprocessing module, a dual-branch fusion module, an efficient aggregation pyramid pooling module, an attention module, and a segmentation head module. The dual-branch fusion module includes a shallow detail feature extraction branch and a deep semantic feature extraction branch that are fused together. After the input image is processed by the feature preprocessing module, it enters the shallow detail feature extraction branch and the deep semantic feature extraction branch respectively. The deep semantic feature map output by the deep semantic feature extraction branch is input to the efficient aggregation pyramid pooling module. The feature map output by the efficient aggregation pyramid pooling module is added to the shallow detail feature map output by the shallow detail feature extraction branch, and then passes through the attention module and is input to the segmentation head module.
[0008] Construct a segmentation sample dataset for autonomous driving road scenarios, and use the segmentation sample dataset to train a model based on the network architecture of a semantic segmentation network;
[0009] High-precision semantic segmentation is achieved in autonomous driving road scenarios using a semantic segmentation network that has completed model training.
[0010] The further technical solution is that the shallow detail feature extraction branch includes N shallow feature extraction layers, and the deep semantic feature extraction branch includes N deep feature extraction layers, where N≥2;
[0011] The shallow detail feature map output by the i-th shallow feature extraction layer is downsampled and then concatenated with the deep semantic feature map output by the i-th deep feature extraction layer before being input into the (i+1)-th deep feature extraction layer. The deep semantic feature map output by the i-th deep feature extraction layer is first compressed through a 1×1 convolution, then upsampled using bilinear interpolation, and then concatenated with the shallow detail feature map output by the i-th shallow feature extraction layer before being input into the (i+1)-th shallow feature extraction layer. The parameter 1≤i≤N-1.
[0012] The further technical solution is that the shallow detail feature extraction branch includes three shallow feature extraction layers, and the output image size of each shallow feature extraction layer remains unchanged from the input image size; the deep semantic feature extraction branch includes three deep feature extraction layers, and the output image size of each deep feature extraction layer is 1 / 2 of the input image size.
[0013] The further technical solution is as follows: In the efficient aggregation pyramid pooling module, the deep semantic feature map output by the deep semantic feature extraction branch is input into a 1×1 convolution, an average pooling unit, and a global average pooling unit, respectively. The feature map output by the deep semantic feature map after the 1×1 convolution is added to the feature map output by the average pooling unit, and then subjected to a 3×3 convolution to obtain the average pooling feature map. The feature map output by the deep semantic feature map after the 1×1 convolution is added to the feature map output by the global average pooling unit, and then subjected to a 3×3 convolution to obtain the global average pooling feature map. The feature map output by the deep semantic feature map after the 1×1 convolution, the average pooling feature map, and the global average pooling feature map are concatenated and then subjected to a 1×1 convolution to output a fused feature map. The feature map output by the deep semantic feature map after the 1×1 convolution is added to the fused feature map and then output.
[0014] The further technical solution is as follows: the average pooling unit in the high-efficiency aggregation pyramid pooling module includes a first pooling layer, a second pooling layer, and a third pooling layer connected in series. The feature map output from the deep semantic feature map after a 1×1 convolution is added in parallel with the feature maps output from the three pooling layers. The feature map output from the first pooling layer is added to the feature map output from the deep semantic feature map after a 1×1 convolution and upsampling. The feature map output from the second pooling layer is added to the feature map output from the deep semantic feature map after a 1×1 convolution and upsampling. The feature map output from the third pooling layer is added to the feature map output from the deep semantic feature map after a 1×1 convolution and upsampling. The feature map output from the global average pooling unit is added to the feature map output from the deep semantic feature map after a 1×1 convolution and upsampling.
[0015] The further technical solution is that the pooling core of the first pooling layer is 5 and the step size is 2, the pooling core of the second pooling layer is 3 and the step size is 2, and the pooling core of the third pooling layer is 3 and the step size is 2.
[0016] The further technical solution is that the attention module adopts the three-dimensional attention model TDAM.
[0017] A further technical solution is that the feature preprocessing module includes two consecutive 3×3 convolutional layers, which are used to downsample the size of the image input to the semantic segmentation network to 1 / 8.
[0018] Its further technical solution involves constructing a segmentation sample dataset for autonomous driving road scenarios, including:
[0019] Video data from autonomous driving road scenarios is acquired, and keyframe images are extracted as sample images. Different segmentation targets in the sample images are labeled and converted to generate mask images. The resulting segmentation sample dataset includes several sample images and a mask image of the same size for each sample image. The mask image corresponding to each sample image contains the label information of each segmentation target in the sample image. The label information includes location information and attribute information. The location information of the segmentation target includes several coordinate points given in the order of labeling. The contour information in the mask image is formed by conversion. The attribute information of the segmentation target is the category information of the segmentation target. The corresponding contour region pixel values are obtained by conversion, and the contour region pixel values are different for different segmentation targets.
[0020] Its further technical solution involves training a model based on the network architecture of a semantic segmentation network using a segmentation sample dataset, including:
[0021] The segmented sample dataset is randomly divided into a training set, a validation set, and a test set;
[0022] Model pre-training was performed using a network architecture based on a semantic segmentation network on the ImageNet dataset.
[0023] The model parameters of the semantic segmentation network are initialized based on the results of model pre-training. Sample images from the training set are input into the semantic segmentation network to obtain predicted segmentation results. The error between the predicted segmentation results and the mask image corresponding to the input sample images is calculated using the cross-entropy loss function. The gradient of the model parameters of the semantic segmentation network is backpropagated according to the cross-entropy loss function, and the model parameters are updated using the gradient descent method. After each iteration, the performance of the semantic segmentation network is evaluated using the validation set until the semantic segmentation network converges.
[0024] The sample images from the test set are input into the converged semantic segmentation network to obtain the predicted segmentation results. The predicted segmentation result for each sample image is compared with the mask image corresponding to the input sample image, and the average intersection-union ratio (CIU) is calculated. Where P is the total number of categories of the segmented targets contained in all sample images. It is the average intersection-union ratio of the i-th class and Q is the total number of sample images in the test set. It is the intersection-union ratio (IoU) between the predicted segmentation result of the j-th sample image in the test set and the corresponding mask image, and any TP is the number of pixels in the j-th sample image whose predicted segmentation result is positive and whose corresponding mask image is also positive; FP is the number of pixels in the j-th sample image whose predicted segmentation result is positive and whose corresponding mask image is also negative; and FN is the number of pixels in the j-th sample image whose predicted segmentation result is negative and whose corresponding mask image is also negative.
[0025] The beneficial technical effects of this application are:
[0026] This application discloses a high-precision semantic segmentation method for autonomous driving road scenarios. The method constructs a semantic segmentation network based on the ResNet model, which includes a dual-branch feature network. The shallow detail feature extraction branch and the deep semantic feature extraction branch fuse information to improve the segmentation ability of the semantic segmentation network. The efficient aggregation pyramid pooling module in the semantic segmentation network can obtain contextual information, and the attention module added at the end of the large-scale branch network can further enhance the semantic information carried by the deep features extracted by the convolutional neural network, highlighting important semantic information in the feature map, and further improving the segmentation ability of the semantic segmentation network, thereby improving the segmentation accuracy in autonomous driving road scenarios.
[0027] The efficient aggregation pyramid pooling module in the semantic segmentation network of this application improves the pooling calculation method, which can reduce the model's computational load and improve the model's inference speed without increasing the number of parameters. Attached Figure Description
[0028] Figure 1 This is a network architecture diagram of the semantic segmentation network constructed in this application.
[0029] Figure 2 This is a structural diagram of the efficient aggregation pyramid pooling module in this application.
[0030] Figure 3 This is a flowchart of a high-precision semantic segmentation method according to an embodiment of this application. Detailed Implementation
[0031] The specific embodiments of this application will be further described below with reference to the accompanying drawings.
[0032] This application discloses a high-precision semantic segmentation method for autonomous driving road scenarios, which includes:
[0033] Step 1: Construct the network architecture of the semantic segmentation network based on the ResNet model.
[0034] Please refer to Figure 1 The semantic segmentation network architecture shown includes a feature preprocessing module, a dual-branch fusion module, an efficient aggregation pyramid pooling module, an attention module, and a segmentation head module. Among them:
[0035] (1) The feature preprocessing module is used to preprocess the input image. In one embodiment, the feature preprocessing module includes two consecutive 3×3 convolutional layers. The two consecutive 3×3 convolutional branches are used as basic modules to construct the subsequent network. The size of the input image to the semantic segmentation network is downsampled to 1 / 8 using a basic block and a 3×3 convolution. Replacing the original 7×7 convolution with two consecutive 3×3 convolutions can effectively reduce the parameters and computational cost of the semantic segmentation network.
[0036] (2) The dual-branch fusion module includes a shallow detail feature extraction branch and a deep semantic feature extraction branch that are fused together.
[0037] Please combine Figure 1 The shallow detail feature extraction branch includes N shallow feature extraction layers, and the deep semantic feature extraction branch includes N deep feature extraction layers, where N ≥ 2. In one embodiment, the shallow detail feature extraction branch includes three shallow feature extraction layers, and the output image size of each shallow feature extraction layer remains unchanged from the input image size, thus both are 1 / 8 of the image input to the semantic segmentation network. The deep semantic feature extraction branch includes three deep feature extraction layers, and the output image size of each deep feature extraction layer is 1 / 2 of the input image size, thus being 1 / 16, 1 / 32, and 1 / 64 of the image input to the semantic segmentation network, respectively.
[0038] The two-branch fusion between the shallow detail feature extraction branch and the deep semantic feature extraction branch includes fusing the shallow detail feature map extracted by the shallow detail feature extraction branch into the deep semantic feature extraction branch, and fusing the deep semantic feature map extracted by the deep semantic feature extraction branch into the shallow detail feature extraction branch. (a) For the fusion of the shallow detail feature map to the deep semantic feature extraction branch: the shallow detail feature map output by any i-th shallow feature extraction layer is downsampled and then concatenated with the deep semantic feature map output by the i-th deep feature extraction layer and input into the (i+1)-th deep feature extraction layer. The downsampling process here typically uses a 3×3 convolution operation with a stride of 2 to achieve consistency of the feature maps. (b) For the fusion of deep semantic feature map to shallow detail feature extraction branch: the deep semantic feature map output by the i-th deep feature extraction layer is first compressed through a 1×1 convolution to reduce the computational cost, and then upsampled using bilinear interpolation. It is then added to the shallow detail feature map output by the i-th shallow feature extraction layer and input into the (i+1)-th shallow feature extraction layer, with parameter 1≤i≤N-1.
[0039] After the input image is processed by the feature preprocessing module, it enters the shallow detail feature extraction branch and the deep semantic feature extraction branch respectively. The deep semantic feature extraction branch finally outputs the deep semantic feature map of the last deep feature extraction layer, and the shallow detail feature extraction branch finally outputs the shallow detail feature map of the last shallow feature extraction layer.
[0040] (3) High-efficiency aggregation pyramid pooling module
[0041] To improve accuracy, a pyramid pooling module (PPM) with different pooling scales can be added to the output of the deep semantic feature extraction branch to help extract multi-scale feature information. PPM can embed contextual information well, but the concatenated feature map obtained by using only a single 3×3 convolution or 1×1 convolution cannot effectively capture contextual information. Furthermore, if hierarchical residuals are used to fuse with the output of the large pooling layer hierarchically, parallel computation cannot be achieved, which is very time-consuming for lightweight embedded networks.
[0042] Therefore, this application improves upon this by adding an efficient aggregation pyramid pooling module to the output of the deep semantic feature extraction branch. The deep semantic feature map output by the deep semantic feature extraction branch is then input into the efficient aggregation pyramid pooling module to extract rich contextual information.
[0043] Please refer to Figure 2 In the efficient pyramid pooling module, the deep semantic feature map output from the deep semantic feature extraction branch is input into a 1×1 convolution, an average pooling unit, and a global average pooling unit, respectively. The feature map output from the 1×1 convolution is added to the feature map output from the average pooling unit, and then subjected to a 3×3 convolution to obtain the average pooling feature map. The feature map output from the 1×1 convolution is added to the feature map output from the global average pooling unit, and then subjected to a 3×3 convolution to obtain the global average pooling feature map. The feature map output from the 1×1 convolution, the average pooling feature map, and the global average pooling feature map are concatenated and then subjected to a 1×1 convolution to output a fused feature map. Finally, the feature map output from the 1×1 convolution is added to the fused feature map and then output.
[0044] The deep semantic feature maps are sequentially convolved with pooling unit outputs through 1×1 convolutions to achieve residual fusion, thus achieving parallelization. Additionally, the average pooling unit in the efficient aggregation pyramid pooling module comprises a first pooling layer, a second pooling layer, and a third pooling layer connected in series. The feature map output from the deep semantic feature map after 1×1 convolution is added in parallel to the feature maps output from the three pooling layers. The feature map output from the first pooling layer is added to the feature map output from the deep semantic feature map after a 1×1 convolution and upsampling. Similarly, the feature map output from the second pooling layer is added to the feature map output from the deep semantic feature map after a 1×1 convolution and upsampling. Furthermore, the feature map output from the global average pooling unit is added to the feature map output from the deep semantic feature map after a 1×1 convolution and upsampling. In one embodiment, the first pooling layer has a pooling kernel of 5 and a step size of 2, the second pooling layer has a pooling kernel of 3 and a step size of 2, and the third pooling layer has a pooling kernel of 3 and a step size of 2. That is, the efficient aggregation pyramid pooling module removes the pooling layer with large pooling kernels, uses multiple pooling layers with small pooling kernels serially, and then outputs them in parallel. Generally speaking, the larger the kernel size, the greater the computational load and the more time it takes. Therefore, this approach can reduce the computational load and improve model performance without increasing the number of parameters.
[0045] (4) Attention Module
[0046] The feature map output by the efficient aggregation pyramid pooling module is added to the shallow detail feature map output by the shallow detail feature extraction branch and then input into the attention module.
[0047] Typically, attention is categorized into channel-based one-dimensional attention and spatial two-dimensional attention. To better utilize channel and spatial attention, this application employs a three-dimensional attention model, TDAM, in its attention module. This model assigns different weights to each quantum element in the semantic segmentation network, thereby assigning different importance labels to the output semantic information and enhancing the focus on important targets. Theoretically, TDAM, as a general attention module, can be inherited after each convolutional layer to improve the output layer's results. However, the reason this application adds TDAM only before the segmentation head module is that deep convolutional neural networks carry more semantic information in their deep feature maps. The aim is to strengthen the representation of this semantic information in the feature maps, thereby improving the overall performance of the model.
[0048] (5) The segmentation head module ultimately realizes image segmentation and outputs the image segmentation results.
[0049] Step 2: Construct a segmentation sample dataset for autonomous driving road scenarios.
[0050] The segmentation sample dataset constructed in this application uses real-world data. Cameras are mounted on the acquisition platform to ensure correct installation and calibration, accurately acquiring real-time data of autonomous driving road scenes. Sensor parameters are adjusted, including exposure time, focal length, and field of view, to ensure the acquired image quality meets requirements. Road scenes are determined according to needs, and path planning is performed. Continuously acquired video data is stored in an appropriate medium. Furthermore, to ensure the reliability of the acquired video data quality, preliminary processing is performed to evaluate aspects such as video clarity and exposure, thereby eliminating low-quality or abnormal video data.
[0051] Then, keyframe images are extracted from the acquired video data as sample images. Different segmentation targets in the sample images are labeled and converted to generate mask images, constructing a segmentation sample dataset consisting of several sample images and a mask image of the same size for each sample image. The mask image corresponding to each sample image contains the label information of each segmentation target in the sample image. The label information includes location and attribute information, obtained from a JSON annotation file. The JSON label file of the sample image can be obtained using the LabelMe semantic annotation tool. The location information of the segmentation target includes several coordinate points given in the order of labeling, which are converted to form the contour information in the mask image. The attribute information of the segmentation target is the category information of the segmentation target, which is converted to obtain the corresponding contour region pixel values. Different segmentation targets have different contour region pixel values; for example, the contour region pixel value of the background is 0, while the contour region pixel values of other categories of segmentation targets are 1, 2, 3, etc.
[0052] Step 3: Train the model using the network architecture of the semantic segmentation network based on the segmentation sample dataset.
[0053] After constructing the segmentation sample dataset in step 2, the dataset is randomly divided into a training set, a validation set, and a test set, typically in a 6:2:2 ratio. Considering that excessively large training set images, while improving the detection accuracy of the semantic segmentation network, can also negatively impact its detection speed and increase memory consumption, a balance between detection accuracy and speed is struck. The training set images are then reduced to a predetermined size, while the validation and test set images are standardized to facilitate subsequent batch processing.
[0054] First, the model is pre-trained on the network architecture of the semantic segmentation network using the ImageNet dataset. The semantic segmentation network constructed in this application is a dual-branch structure improved from the ResNet structure. The original pre-trained weights cannot be well adapted to the new model, while the public dataset has good universality and can learn general features and patterns, which can then be transferred to the segmentation sample dataset. Therefore, the model is pre-trained on the public dataset first.
[0055] The model parameters of the semantic segmentation network are initialized based on the pre-training results. Sample images from the training set are input into the semantic segmentation network to obtain predicted segmentation results. The error between the predicted segmentation result and the mask image corresponding to the input sample image is calculated using the cross-entropy loss function. The gradient of the model parameters of the semantic segmentation network is backpropagated according to the cross-entropy loss function, and the model parameters are updated using gradient descent. After each iteration, the performance of the semantic segmentation network is evaluated using a validation set. If overfitting occurs, the model parameters of the semantic segmentation network can be adjusted and optimized to improve its generalization ability. If the cross-entropy loss function tends to stabilize during iteration, the semantic segmentation network is considered to have converged, and the training process is complete.
[0056] After training is complete, the trained semantic segmentation network can be tested using a test set. This includes inputting sample images from the test set into the converged semantic segmentation network to obtain predicted segmentation results, comparing the predicted segmentation results for each sample image with the mask image corresponding to the input sample image, and calculating the average intersection-union ratio (IU / R). Where P is the total number of categories of the segmented targets contained in all sample images. It is the average intersection-union ratio of the i-th class and Q is the total number of sample images in the test set. It is the intersection-over-union ratio (IoU) between the predicted segmentation result of the j-th sample image in the test set and the corresponding mask image. (Arbitrary) TP is the number of pixels in the j-th sample image whose predicted segmentation is positive and whose corresponding mask is also positive; FP is the number of pixels in the j-th sample image whose predicted segmentation is positive and whose corresponding mask is also negative; FN is the number of pixels in the j-th sample image whose predicted segmentation is negative and whose corresponding mask is also negative. Mean Intersection over Union (Intersection over Union) index. The value range is 0 to 1, and the average crossover ratio index The closer the value is to 1, the better the segmentation effect of the trained semantic segmentation network in autonomous driving road scenarios.
[0057] It should be noted that steps 2 and 3 do not have a specific execution order from step 1, which involves constructing the network architecture for the semantic segmentation network. Figure 3 The flowchart shown can be executed in parallel.
[0058] Once the semantic segmentation network is trained and its performance meets the requirements, it can be used to perform high-precision semantic segmentation in autonomous driving road scenarios. The segmentation result can be obtained by inputting the image to be segmented in the autonomous driving road scenario into the semantic segmentation network.
[0059] The above descriptions are merely preferred embodiments of this application, and this application is not limited to the above embodiments. It is understood that other improvements and variations that can be directly derived or conceived by those skilled in the art without departing from the spirit and concept of this application should be considered to be included within the protection scope of this application.
Claims
1. A high-precision semantic segmentation method for autonomous driving road scenarios, characterized in that, The high-precision semantic segmentation method includes: The semantic segmentation network architecture is based on the ResNet model. The semantic segmentation network includes a feature preprocessing module, a dual-branch fusion module, an efficient aggregation pyramid pooling module, an attention module, and a segmentation head module. The dual-branch fusion module includes a shallow detail feature extraction branch and a deep semantic feature extraction branch that are fused together. After the input image is processed by the feature preprocessing module, it enters the shallow detail feature extraction branch and the deep semantic feature extraction branch respectively. The shallow detail feature extraction branch includes N shallow feature extraction layers, and the deep semantic feature extraction branch includes N deep feature extraction layers, where N≥2. The shallow detail feature map output by the i-th shallow feature extraction layer is downsampled and then combined with the shallow detail feature map output by the i-th deep feature extraction layer. The deep semantic feature maps output by the deep feature extraction layer are concatenated and fused, and then input into the (i+1)th deep feature extraction layer. The deep semantic feature map output by the ith deep feature extraction layer is first compressed through a 1×1 convolution, then upsampled using bilinear interpolation, and then concatenated and fused with the shallow detail feature map output by the ith shallow feature extraction layer, and input into the (i+1)th shallow feature extraction layer, where parameters 1≤i≤N-1. The deep semantic feature map finally output by the deep semantic feature extraction branch is input into the efficient aggregation pyramid pooling module. In the efficient aggregation pyramid pooling module, the deep semantic feature map finally output by the deep semantic feature extraction branch is input into a 1×1 convolution, an average pooling unit, and a global average pooling unit, respectively. The average pooling unit in the pyramid pooling module includes a first pooling layer, a second pooling layer, and a third pooling layer connected in series. The feature map output from the deep semantic feature map after a 1×1 convolution is added in parallel to the feature maps output from the three pooling layers. The feature map output from the first pooling layer is added to the feature map output from the deep semantic feature map after a 1×1 convolution and upsampling. Similarly, the feature map output from the second pooling layer is added to the feature map output from the deep semantic feature map after a 1×1 convolution and upsampling. The feature map output from the third pooling layer is added to the feature map output from the deep semantic feature map after a 1×1 convolution and upsampling. The global average pooling unit outputs... The feature map is convolved with a 1×1 convolution and upsampled, then added to the feature map output from the deep semantic feature map after a 1×1 convolution. The feature map output from the deep semantic feature map after a 1×1 convolution is added to the feature map output from the average pooling unit, and then convolved with a 3×3 convolution to obtain the average pooling feature map. The feature map output from the deep semantic feature map after a 1×1 convolution is added to the feature map output from the global average pooling unit, and then convolved with a 3×3 convolution to obtain the global average pooling feature map. The feature map output from the deep semantic feature map after a 1×1 convolution, the average pooling feature map, and the global average pooling feature map are concatenated and then convolved with a 1×1 convolution to output the fused feature map. The feature map output from the deep semantic feature map after a 1×1 convolution is added to the fused feature map and then output.The feature map output by the efficient aggregation pyramid pooling module is added to the shallow detail feature map output by the shallow detail feature extraction branch, then passed through the attention module and input into the segmentation head module; Construct a segmentation sample dataset for autonomous driving road scenarios, and use the segmentation sample dataset to train a model based on the network architecture of the semantic segmentation network; The semantic segmentation network, after model training, is used to perform high-precision semantic segmentation in autonomous driving road scenarios.
2. The high-precision semantic segmentation method according to claim 1, characterized in that, The shallow detail feature extraction branch includes three shallow feature extraction layers, and the output image size of each shallow feature extraction layer remains unchanged from the input image size; the deep semantic feature extraction branch includes three deep feature extraction layers, and the output image size of each deep feature extraction layer is 1 / 2 of the input image size.
3. The high-precision semantic segmentation method according to claim 1, characterized in that, The pooling kernel of the first pooling layer is 5 and the step size is 2; the pooling kernel of the second pooling layer is 3 and the step size is 2; and the pooling kernel of the third pooling layer is 3 and the step size is 2.
4. The high-precision semantic segmentation method according to claim 1, characterized in that, The attention module uses the 3D attention model TDAM.
5. The high-precision semantic segmentation method according to claim 1, characterized in that, The feature preprocessing module includes two consecutive 3×3 convolutional layers to downsample the size of the image input to the semantic segmentation network to 1 / 8.
6. The high-precision semantic segmentation method according to claim 1, characterized in that, The segmentation sample dataset for constructing autonomous driving road scenarios includes: Video data from autonomous driving road scenarios is acquired, and keyframe images are extracted as sample images. Different segmentation targets in the sample images are labeled and converted to generate mask images. The resulting segmentation sample dataset includes several sample images and a mask image of the same size for each sample image. The mask image corresponding to each sample image contains the label information of each segmentation target in the sample image. The label information includes location information and attribute information. The location information of the segmentation target includes several coordinate points given in the order of labeling. The contour information in the mask image is formed by conversion. The attribute information of the segmentation target is the category information of the segmentation target. The corresponding contour region pixel values are obtained by conversion, and the contour region pixel values are different for different segmentation targets.
7. The high-precision semantic segmentation method according to claim 6, characterized in that, Model training based on the semantic segmentation network architecture using the segmentation sample dataset includes: The segmented sample dataset is randomly divided into a training set, a validation set, and a test set; Model pre-training was performed using a network architecture based on a semantic segmentation network on the ImageNet dataset. The model parameters of the semantic segmentation network are initialized based on the results of model pre-training. Sample images from the training set are input into the semantic segmentation network to obtain predicted segmentation results. The error between the predicted segmentation results and the mask image corresponding to the input sample images is calculated using the cross-entropy loss function. The gradient of the model parameters of the semantic segmentation network is backpropagated according to the cross-entropy loss function, and the model parameters are updated using the gradient descent method. After each iteration, the performance of the semantic segmentation network is evaluated using the validation set until the semantic segmentation network converges. The sample images from the test set are input into the converged semantic segmentation network to obtain the predicted segmentation results. The predicted segmentation result for each sample image is compared with the mask image corresponding to the input sample image, and the average intersection-union ratio (CIU) is calculated. ,in, It is the total category of the segmented objects contained in all sample images. It is the first one The average intersection-union ratio of the classes and , It is the total number of sample images in the test set. It is the first in the test set The intersection-union ratio (IUU) between the predicted segmentation results of the sample images and the corresponding mask images, and any , It is the first The predicted segmentation result of the sample image is positive, and the corresponding mask image represents the number of pixels in the positive class. It is the first The number of pixels in a sample image whose predicted segmentation result is positive and whose corresponding mask image is negative. It is the first The predicted segmentation result of the sample image is negative, and the corresponding mask image is the number of pixels of the negative class.