A method for constructing a zero-shot semantic segmentation map by combining a road visual language model with dynamic objects

By employing vector quantization and multimodal data fusion, image features under adverse weather conditions are recovered and high-precision dynamic object segmentation maps are generated. This solves the perception and map building problems of autonomous driving systems under adverse weather conditions, and improves the system's perception and decision-making capabilities.

CN120747503BActive Publication Date: 2026-06-19BEIHANG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIHANG UNIV
Filing Date
2025-06-18
Publication Date
2026-06-19

Smart Images

  • Figure CN120747503B_ABST
    Figure CN120747503B_ABST
Patent Text Reader

Abstract

This invention discloses a zero-shot semantic segmentation map construction method that combines a road visual language model with dynamic objects, belonging to the technical field of zero-shot semantic segmentation map construction methods. This method uses a vector-quantized image restoration model to recover features from degraded panoramic images under severe weather conditions, thereby restoring image features to those under normal weather conditions. Then, the road visual language model is used to analyze the restored image, generate semantic descriptions, and extract information about key elements on the road.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a zero-shot semantic segmentation map construction method, and more particularly to a zero-shot semantic segmentation map construction method that combines a road visual language model with dynamic objects, belonging to the technical field of zero-shot semantic segmentation map construction methods. Background Technology

[0002] With the continuous development of autonomous driving technology, building high-precision, dynamically updated maps has become crucial to ensuring the stable operation of autonomous driving systems. Especially under adverse weather conditions (such as rain, snow, and fog), the performance of sensors (such as cameras and LiDAR) in autonomous vehicles is greatly limited, leading to a decrease in perception capabilities and affecting the accuracy of map building. Therefore, how to improve the perception accuracy and map building capabilities of autonomous driving systems under complex weather conditions is one of the challenges currently facing autonomous driving technology.

[0003] Traditional image restoration methods are often optimized for specific severe weather conditions and are difficult to adapt to various weather changes. In autonomous driving systems, map updates and generation not only need to restore image quality under severe weather conditions, but also need to identify and process changes in dynamic objects in a timely manner. To address these issues, zero-sample semantic segmentation and map building techniques for dynamic objects, which combine visual language models, have emerged. Through multimodal data fusion, image restoration, and semantic segmentation techniques, the perception capabilities of autonomous driving systems in dynamic environments can be effectively improved, and accurate vector maps can be constructed.

[0004] This technology proposes a map construction method that combines a road visual language model and zero-shot semantic segmentation of dynamic objects. Summary of the Invention

[0005] The main objective of this invention is to provide a zero-sample semantic segmentation map construction method that combines a road visual language model with dynamic objects, aiming to solve the problems of image restoration, dynamic object detection, and semantic information generation under adverse weather conditions, thereby enhancing the perception accuracy and decision-making ability of autonomous driving systems.

[0006] The objective of this invention can be achieved by adopting the following technical solution:

[0007] A zero-shot semantic segmentation map construction method combining a road visual language model and dynamic objects includes the following steps:

[0008] Step 1: Input the degraded loop image into the vector quantization-based image restoration model;

[0009] Step 2: Input the restored image into the road visual language model;

[0010] Step 3: Based on the visual feature map with semantic descriptions extracted above, and combined with LiDAR and millimeter-wave radar data, integrate multimodal information through a multimodal fusion strategy;

[0011] Step 4: Use a BEV encoder to solve the feature map alignment problem and extract temporal features;

[0012] Step 5: Fuse the associated features with the current frame's BEV features, and process them through a convolutional layer to obtain the moving target segmentation map.

[0013] Preferably, in step one, vector quantization and optimization techniques are used to effectively restore the image features under severe weather conditions to near-normal weather conditions.

[0014] In step two, the road visual language model semantically describes information about vehicles, sidewalks, and road boundaries on the road.

[0015] Preferably, in step one, the goal of image feature extraction is to extract important information from the input image that can reflect the core content of the image.

[0016] Assuming the input image is set as I, the CNN will gradually map the original image into a high-dimensional feature space through several layers of transformation operations. Specifically, it will extract image features by using convolution and pooling operations.

[0017] Assuming that each image I generates a feature vector after undergoing the corresponding processing steps in the network. Here, d represents the dimension of the feature space;

[0018] In constructing the normal image codebook, the process involves training with a large number of images under normal weather conditions to understand the feature distribution of images under normal weather conditions. It is assumed that there exists a training set D. normal ={I1,I2,...,I N This training set consists of N images taken under normal weather conditions.

[0019] After each image Ii undergoes the CNN feature extraction process, a feature vector is obtained.

[0020] extract the feature vectors of all normal images Construct a feature matrix Each row of this matrix represents a feature of a normal weather image;

[0021] After performing clustering operations on these feature vectors, a codebook consisting of K cluster centers is obtained, where each cluster center CK They all represent a certain modality of image features under normal weather conditions;

[0022] The codebook is defined as a set. Each of them

[0023] The input vector is mapped to a finite set of vectors, and this finite set of vectors is called the codebook;

[0024] Define the input severe weather image as I bad After feature extraction by CNN, its feature vector is obtained.

[0025] To recover the features of severe weather images, they need to be mapped into the feature space of normal weather images. The specific implementation process is achieved by using vector quantization.

[0026] The goal of vector quantization is to find the feature vector F of the severe weather image from the codebook. bad The closest cluster center C k This process can be represented by the following mathematical formula:

[0027]

[0028] In this context, This represents the feature vector obtained after vector quantization, while |F bad -C k |Refers to F bad With a certain cluster center C in the codebook k There exists an Euclidean distance between them. The cluster center C with the smallest distance is selected. k This achieved the mapping of features from severe weather images.

[0029] Preferably, the design of the loss function needs to consider how to reduce the differences between the features of severe weather images and those of normal weather images;

[0030] Using the Euclidean distance loss function as the distance loss function, it is defined as follows:

[0031]

[0032] in, It refers to the feature vector corresponding to the i-th severe weather image, but it is the recovered feature obtained by means of vector quantization;

[0033] The purpose of this loss function is to minimize the difference between the features of severe weather images and the recovered features.

[0034] By adding a regularization term to ensure the stability and consistency of the recovered features, and using L2 regularization, the objective function for optimization can be expressed as:

[0035]

[0036] L exists as a regularization coefficient, while θ belongs to the category of model parameters. It refers to the L2 norm of the model parameters;

[0037] Image features exhibited under extreme weather conditions, after undergoing vector quantization and optimization processes using image restoration models, will gradually approximate the image feature distribution F under normal weather conditions. normal The direction of development. The features obtained after the recovery operation can be represented in the following form:

[0038]

[0039] By using vector quantization and optimization techniques, image features under adverse weather conditions can be effectively restored to those under normal weather conditions, thereby improving the perception performance of autonomous driving under adverse weather conditions.

[0040] Preferably, step two includes inputting the restored image into the road visual language model, whereby the model begins to perform in-depth analysis of the road scene in the image and generates a semantic description;

[0041] This process first involves the identification and localization of various elements in the image, including vehicles, pedestrians, sidewalks, lane lines, traffic signs, and road boundaries;

[0042] The model extracts image features using convolutional neural network visual processing algorithms to identify the specific location and category of each object;

[0043] Next, the visual language model will generate corresponding semantic descriptions based on these visual features;

[0044] The key to visual language models lies in their deep learning architecture, which is able to understand the relationship between each element in an image and its surrounding environment;

[0045] By training on large-scale datasets, the model continuously learns how to extract useful information from images and transform this information into concise and accurate language descriptions.

[0046] Preferably, step three includes: by introducing semantic tags, the representational power of the visual feature map is enhanced;

[0047] Data fusion is achieved through a variety of strategies:

[0048] Feature-level fusion refers to combining data features from different sensors. Specifically, visual feature maps and radar point cloud data are typically processed jointly in a deep neural network.

[0049] Convolutional neural network models automatically select the influence of different sensor data through attention mechanisms or weighted averaging strategies, thereby determining how to process various types of information based on the specific characteristics of the scene;

[0050] Decision-level fusion involves merging the prediction results from different sensors after they have been processed independently. In this process, the perception modules of vision, radar, and millimeter-wave radar perform target detection, localization, and tracking tasks respectively, and finally integrate this information into the final decision output.

[0051] The BEV encoder maps the 3D point cloud data of LiDAR, visual images, and distance information of millimeter-wave radar to the same coordinate system.

[0052] The BEV encoder plays a crucial role in the alignment of multimodal data.

[0053] Beneficial technical effects of the present invention:

[0054] This invention provides a zero-shot semantic segmentation map construction method that combines a road visual language model with dynamic objects. This method uses a vector-quantized image restoration model to recover features from degraded surround-view images under adverse weather conditions, thereby restoring image features to normal weather conditions. Then, the road visual language model is used to analyze the restored image, generating semantic descriptions and extracting information on key road elements (such as vehicles, pedestrians, lane lines, and traffic signs). Next, by combining LiDAR and millimeter-wave radar data, a multimodal fusion strategy is used to integrate visual and radar information. A BEV encoder is used to align the feature maps and extract temporal features. Finally, the fused features are combined with the current frame's BEV features and processed through a convolutional layer to generate a dynamic object segmentation map. This method effectively addresses image degradation under various adverse weather conditions and improves the detection accuracy of dynamic objects through semantic segmentation technology. By combining a visual language model with multimodal data fusion, it updates and accurately constructs high-precision vector maps required by autonomous driving systems in real time, significantly improving the system's perception capabilities and decision-making efficiency. Attached Figure Description

[0055] Figure 1 This is a flowchart of a preferred embodiment of a zero-sample semantic segmentation map construction method that combines a road visual language model with dynamic objects according to the present invention. Detailed Implementation

[0056] To enable those skilled in the art to understand the technical solution of the present invention more clearly, the present invention will be further described in detail below with reference to the embodiments and accompanying drawings, but the embodiments of the present invention are not limited thereto.

[0057] The method includes:

[0058] Step S1: Input the degraded panoramic image into the image restoration model based on vector quantization. With the help of vector quantization and optimization techniques, the image features under severe weather conditions are effectively restored to the image features under normal weather conditions.

[0059] Step S2: Subsequently, the restored image is input into the road visual language model, which semantically describes information about vehicles, sidewalks, and road boundaries on the road.

[0060] Step S3: Based on the visual feature map with semantic description, combine the LiDAR and millimeter-wave radar data, integrate multimodal information through a multimodal fusion strategy, then use a BEV encoder to solve the feature map alignment problem and extract temporal features, and finally fuse the associated features with the current frame BEV features, and process them through a convolutional layer to obtain the moving target segmentation map.

[0061] Step S1 includes,

[0062] Severe weather conditions, such as rain, snow, and fog, not only reduce the perception capabilities of autonomous vehicle sensors (e.g., cameras, LiDAR) but also severely impact the accuracy of map construction. In such situations, conventional image restoration techniques are often only optimized for specific weather conditions and struggle to cope with different types of severe weather. To improve map construction accuracy, we propose a general image restoration model that aims to effectively recover image features under various severe weather conditions through vector quantization, enabling map construction to maintain high accuracy in complex environments.

[0063] In image feature extraction, the goal is to extract key information from the input image that reflects its core content. Traditional image processing methods often rely on manually designed algorithms, such as SIFT and HOG. However, with the continuous development of deep learning, Convolutional Neural Networks (CNNs) have gradually become the mainstream tool for image feature extraction. CNNs, through multi-layered convolutional operations, pooling processes, and fully connected layers, can autonomously learn and extract high-level features from images.

[0064] Assuming the input image is denoted as I, the CNN will use several layers of transformation operations to gradually map the original image into a high-dimensional feature space. Specifically, it will extract image features using convolution and pooling operations. Each image I, after undergoing the network's processing, will generate a feature vector. Here, d represents the dimension of the feature space.

[0065] In constructing the normal image codebook, the process involves training with a large number of images under normal weather conditions to understand the feature distribution of images under normal weather conditions. It is assumed that there exists a training set D. normal ={I1,I2,...,I N This training set consists of N images taken under normal weather conditions. Each image Ii, after undergoing CNN feature extraction, yields a feature vector.

[0066] To construct a codebook for normal images, the first thing we need to do is to extract the feature vectors of all normal images. Construct a feature matrix Each row of this matrix represents a feature of a normal weather image.

[0067] After performing clustering operations on these feature vectors, a codebook consisting of K cluster centers is obtained, where each cluster center C... K Each represents a specific modality of image features under normal weather conditions. This codebook can be defined as a set. Each of them

[0068] Vector quantization, or VQ for short, is a process that maps an input vector to a finite set of vectors, which is called the codebook. In this paper, we apply VQ to the task of feature recovery of severe weather images.

[0069] We define the input severe weather image as I bad After feature extraction by CNN, its feature vector is obtained. To recover the features of severe weather images, they need to be mapped into the feature space of normal weather images. This is achieved through vector quantization.

[0070] The goal of vector quantization is to find the feature vector F of the severe weather image from the codebook. bad The closest cluster center C kThis process can be represented by the following mathematical formula:

[0071] In this context, This represents the feature vector obtained after vector quantization, while |F bad -C k |Refers to F bad With a certain cluster center C in the codebook k There exists an Euclidean distance between them. The cluster center C with the smallest distance is selected. k This achieved the mapping of features from severe weather images.

[0072] The optimization process of vector quantization: In order to make the features of severe weather images as close as possible to the feature distribution of normal weather images, we need to adjust the model parameters by optimizing the loss function in the vector quantization process. In detail, the design of the loss function needs to consider how to reduce the differences between the features of severe weather images and normal weather images.

[0073] We use the Euclidean distance loss function as the distance loss function, and define it as:

[0074] in, This refers to the feature vector corresponding to the i-th severe weather image, but it is not the reconstructed feature obtained through vector quantization. The purpose of this loss function is to minimize the difference between the features of the severe weather image and the reconstructed feature.

[0075] To make the recovery process more efficient, in addition to the Euclidean distance loss, a regularization term can be added to ensure that the recovered features maintain stability and consistency. We use L2 regularization, and the optimized objective function can be expressed as:

[0076]

[0077] Here, L serves as the regularization coefficient, while θ falls under the category of model parameters. It refers to the L2 norm of the model parameters.

[0078] Image features exhibited under extreme weather conditions, after undergoing vector quantization and optimization processes using image restoration models, will gradually approximate the image feature distribution F under normal weather conditions. normal The direction of development. The features obtained after the recovery operation can be represented in the following form:

[0079]

[0080] By using vector quantization and optimization techniques, we can effectively restore image features under adverse weather conditions to near-normal weather conditions, thereby improving the perception performance of autonomous driving under adverse weather conditions.

[0081] Step S2 includes,

[0082] After the restored image is input into the road visual language model, the model begins to perform in-depth analysis of the road scene in the image and generate semantic descriptions. This process first involves the identification and localization of various elements in the image, including vehicles, pedestrians, sidewalks, lane lines, traffic signs, and road boundaries. The model extracts image features using convolutional neural network (CNN) visual processing algorithms to identify the specific location and category of each object. For example, the model can detect a red car parked in the lane ahead, or a clearly defined sidewalk on one side of the road. Simultaneously, the model can also identify the presence of traffic signs and traffic lights, and mark their locations in the image.

[0083] Next, the visual language model generates corresponding semantic descriptions based on these visual features. For example, after identifying a vehicle, the model might generate a statement like "50 meters ahead, a red sedan is parked on the side of the road, the lane is 3 meters wide"; while for identifying road boundaries, the model might generate a description like "There is a clear lane line on the right side of the road." The model can not only describe static objects but also understand the spatial relationships between them, thus generating more detailed and accurate semantic information. For example, when identifying a car, the model might automatically infer its positional relationship, such as "This car is parked on the right side of the road, about 50 meters from the intersection," and reflect the relative positions of the objects in the description.

[0084] The key to visual language models lies in their deep learning architecture, which enables them to understand the relationship between each element in an image and its surrounding environment. Trained on large-scale datasets, the model continuously learns how to extract relevant information from images and transform it into concise and accurate linguistic descriptions. In this way, the model can convert complex visual information into natural language, providing autonomous driving systems with clear environmental awareness. These semantic descriptions not only help the system better understand road scenes but also provide fundamental support for path planning, obstacle avoidance, and driving decisions, thereby improving the safety and reliability of autonomous driving.

[0085] Ultimately, after the restored image is input into the visual language model, the system can generate a detailed semantic description based on various types of information in the image, providing accurate environmental perception for the autonomous driving system.

[0086] Step S3 includes:

[0087] The representational power of visual feature maps is enhanced by the introduction of semantic labels. For example, a red car in an image will be labeled as a "vehicle" after semantic segmentation. The model can determine that the object in the region is a vehicle through semantic information, and then infer its motion state and size features. For static objects, the model can further identify their spatial structure, such as "lane lines" or "traffic signs".

[0088] Multimodal data fusion is one of the core technologies for improving the perception accuracy of autonomous driving. In practical applications, vision, LiDAR, and millimeter-wave radar are each responsible for perceiving different physical features. How to effectively combine this information to generate a unified and accurate perception result is a technical challenge. Specifically, data fusion can be achieved through various strategies:

[0089] Feature-level fusion refers to combining data features from different sensors. Specifically, visual feature maps and radar point cloud data are typically processed jointly in a deep neural network. Visual data provides the model with information about the texture, color, and shape of objects, while LiDAR and millimeter-wave radar provide spatial features such as distance and velocity. Through feature-level fusion, multi-source data can be effectively integrated, improving the accuracy of the perception system.

[0090] To achieve feature-level fusion, convolutional neural networks (CNNs) are typically used. These networks learn the relationships between different modalities of data and integrate them within the same feature space. In this process, the model automatically selects the influence of different sensor data through attention mechanisms or weighted averaging strategies, thereby determining how to process various types of information based on the specific characteristics of the scene.

[0091] Decision-level fusion involves merging the predictions from different sensor data after they have been processed independently. In this process, the perception modules of vision, radar, and millimeter-wave radar each perform target detection, localization, and tracking tasks, respectively, and finally integrate this information into a final decision output. For example, the vision module detects a vehicle in an image, while the radar module provides the vehicle's speed and relative position; the model combines this information to generate a comprehensive judgment. This strategy is typically used to address situations where sensor data is incomplete or of poor quality.

[0092] Based on multimodal data fusion, BEV encoders are widely used in autonomous driving systems to address the spatial alignment issues of data from different modalities. The BEV view, captured from a top-down perspective, clearly displays the spatial layout of roads, lane lines, traffic signs, and the surrounding environment. It converts spatial information from different sensors into a unified planar coordinate system, helping to solve data alignment problems at different viewpoints and resolutions.

[0093] The primary task of a BEV encoder is to map 3D point cloud data from LiDAR, visual images, and distance information from millimeter-wave radar onto the same coordinate system. This process first projects the sensor data into the BEV space through geometric transformation, and then extracts features from the scene using convolutional layers or other deep learning methods. These features represent objects and environmental elements in the scene, such as road boundaries, lane lines, and the positions of obstacles.

[0094] The BEV encoder plays a crucial role in the alignment of multimodal data. It unifies the spatial relationships of different modal data into a shared coordinate system, enabling the model to process these data from a single viewpoint, thereby reducing errors caused by differences in sensor perspective and resolution. This process not only improves the accuracy of perception but also enhances the robustness of the model.

[0095] In autonomous driving, the extraction of temporal features is crucial for the detection and tracking of dynamic objects. As time progresses, the positions and states of objects in a scene change, necessitating the extraction of temporal features from multiple time frames to better understand their motion trajectories and behavioral patterns. BEV encoders can process historical frame data using time-series information to extract time-related features, thereby helping the model predict the future positions and motion states of dynamic objects.

[0096] Temporal feature extraction is typically achieved using Temporal Convolutional Networks (TCNs). These networks capture temporal dependencies in time-series data, thereby inferring the motion trends of objects. In autonomous driving applications, temporal feature extraction not only helps detect static objects but also identifies the trajectories of dynamic objects, thus predicting their future motion. By effectively processing temporal features, autonomous driving systems can react in advance, improving system responsiveness and safety.

[0097] After fusing multimodal data and extracting temporal features, the next task is to generate a moving target segmentation map. A moving target segmentation map distinguishes moving targets (such as moving vehicles and pedestrians) from the static background and accurately marks the contours and positions of these moving targets. To achieve this, the system uses a U-Net network architecture for image segmentation.

[0098] When generating a moving object segmentation map, the system first extracts features from the feature map obtained by the BEV encoder. Then, it processes the feature map through a convolutional layer to generate a binary segmentation map, where the moving object region is labeled as "foreground" and other regions are labeled as "background". In this way, the model can accurately segment dynamic objects in the scene and generate a final semantic segmentation map containing the dynamic objects.

[0099] I. Overview of Application Scenarios

[0100] In urban autonomous driving scenarios, autonomous vehicles need to perceive surrounding objects and build high-precision maps in real time in complex and ever-changing road environments. Especially under adverse weather conditions, such as rain, snow, and fog, sensor performance is limited, image quality degrades, leading to reduced perception accuracy and impacting the accuracy of map construction. This embodiment combines a road vision language model with a zero-shot semantic segmentation map construction method for dynamic objects in urban autonomous driving scenarios to address image degradation issues under adverse weather conditions, improve the accuracy of dynamic object detection, and achieve real-time updates and accurate construction of high-precision vector maps.

[0101] Implementation steps

[0102] Construction and training of image restoration models;

[0103] Collect a large number of urban road images under normal weather conditions to construct a training set. The images should cover urban road scenes under different time periods and different weather conditions (sunny days), including various vehicles, pedestrians, lane lines, and traffic sign elements.

[0104] Convolutional Neural Networks (CNNs) are used to extract features from the training set images. Each image is input into the CNN, and after convolution and pooling operations, a high-dimensional feature vector is generated. Assuming the input image is I, the feature vector V is obtained after processing by the CNN.

[0105] Cluster the feature vectors of all normal weather images to construct a codebook for normal images. Determine the number of cluster centers K, and obtain K cluster centers using a clustering algorithm (such as K-Means). Each cluster center represents a mode of normal weather image features, forming a codebook C.

[0106] A vector quantization (VQ) model is constructed. For severe weather images, feature vectors V' are obtained through CNN feature extraction. VQ is then used to map V' onto the codebook C, and the cluster center C_i closest to V' is found to achieve the mapping of severe weather image features to the feature space of normal weather images.

[0107] The loss function is designed, including the Euclidean distance loss function and an L2 regularization term. The Euclidean distance loss function measures the difference between the features of the severe weather image and the restored features, while the L2 regularization term ensures the stability and consistency of the restored features. By optimizing the loss function and adjusting the model parameters, the features of the severe weather image are made to approximate the feature distribution of the normal weather image as closely as possible. The image restoration model is trained until it achieves satisfactory image restoration results on the validation set.

[0108] Training and application of road visual language models

[0109] Collect an image dataset containing urban road scenes, including vehicles, pedestrians, sidewalks, lane lines, traffic signs, and road boundary elements, and annotate these elements.

[0110] The CNN vision processing algorithm is used to extract features from the image and identify the specific location and category of each object. For example, it can detect a red car parked in the lane ahead, or a clear pedestrian crossing on the side of the road, while also identifying the location of traffic signs and traffic lights.

[0111] Construct a road visual language model based on a deep learning architecture, such as Transformer or its variants. Input extracted visual features into the model, and the model learns how to convert visual information into natural language descriptions. Train the model on a large-scale dataset, adjusting its parameters to generate accurate and concise semantic descriptions.

[0112] The restored image is input into a pre-trained road vision language model. The model performs in-depth analysis of the road scene in the image and generates semantic descriptions. For example, after identifying a vehicle, it generates the statement "50 meters ahead, a red sedan is parked on the side of the road, the lane is 3 meters wide"; after identifying the road boundary, it generates the description "There is a clear lane line on the right side of the road". The model not only describes static objects, but also understands the spatial relationships between them, such as "This car is parked on the right side of the road, about 50 meters from the intersection", providing clear environmental awareness for autonomous driving systems.

[0113] (III) Application of Multimodal Data Fusion and BEV Encoder

[0114] Collect LiDAR and millimeter-wave radar data. LiDAR provides 3D point cloud information of the environment, including the distance, height, and density of objects; millimeter-wave radar provides the distance, velocity, and angle information of objects. This data is synchronized with the reconstructed image data in time to ensure that multimodal data in the same scene corresponds to the same timestamp.

[0115] Feature-level fusion is performed. Visual feature maps and radar point cloud data are input into a convolutional neural network (CNN). The network learns the relationships between different modalities of data, mapping them into the same feature space. Through attention mechanisms or weighted averaging strategies, the influence of different sensor data is automatically selected, and various types of information are processed according to the characteristics of the scene. For example, when dealing with congested scenes with a large number of vehicles, the network may assign higher weights to radar data to utilize the vehicle speed and distance information it provides; while when dealing with intersection scenes with dense traffic signs, more emphasis may be placed on the sign recognition results in the visual data.

[0116] A BEV encoder is constructed. 3D point cloud data from LiDAR, visual images, and distance information from millimeter-wave radar are projected into the BEV space through geometric transformation. Within the BEV space, convolutional layers or other deep learning methods are used to extract scene features representing objects and environmental elements such as road boundaries, lane lines, and obstacle positions. The BEV encoder addresses the spatial alignment problem of different modal data, unifying the spatial relationships of multimodal data into a shared coordinate system and reducing errors caused by differences in sensor viewpoint and resolution.

[0117] Extracting temporal features. A Temporal Convolutional Network (TCN) is used to process historical frame data, capturing temporal dependencies in the time-series data and inferring the motion trends of objects. The TCN extracts temporally relevant features from multiple time frames, helping the model predict the future position and motion state of dynamic objects. For example, for a moving vehicle, the TCN predicts its next position based on its position changes over the past few frames, enabling the autonomous driving system to react in advance, improving response speed and safety.

[0118] Generation of moving target segmentation maps;

[0119] The feature map obtained from the BEV encoder is input into the U-Net network architecture. The U-Net network processes the feature map through convolutional layers, pooling layers, and upsampling layers.

[0120] In the U-Net network, feature maps undergo convolution operations to extract deeper features, while pooling operations reduce the size and computational cost of the feature maps, preserving important features. Upsampling operations restore the feature maps to a size close to the original image for pixel-level classification.

[0121] The final result is a binary segmentation map, where moving object regions are labeled as "foreground" and other regions are labeled as "background". For example, moving vehicles and pedestrians are accurately segmented, while static lane lines and traffic signs are labeled as background.

[0122] By combining the generated moving object segmentation map with semantic descriptions, a semantic segmentation map containing dynamic objects is constructed. This map provides fundamental support for path planning, obstacle avoidance, and driving decisions in autonomous driving systems, improving the system's safety and reliability.

[0123] Implementation effect evaluation

[0124] Field tests were conducted on urban autonomous driving scenarios under various weather conditions, including sunny, rainy, snowy, and foggy weather. Several representative road sections were selected, including straight roads, curves, intersections, school zones, and commercial areas, to evaluate the performance of map building methods in various scenarios.

[0125] Evaluation metrics include the accuracy of semantic segmentation maps (such as the segmentation accuracy of lane lines and traffic signs, and the IOU, which is the overlap rate between predicted and actual values).

[0126] Comparative experiments were conducted with traditional map building methods to verify the proposed method's ability to recover image degradation under adverse weather conditions and its improved accuracy in detecting dynamic objects. By comparing the test results, the advantages and disadvantages of the proposed method were analyzed, providing a basis for further optimization.

[0127] The formula for calculating the segmentation accuracy index is:

[0128]

[0129] in:

[0130] TP: True Positive, which refers to pixels that are correctly predicted as the target region.

[0131] TN: True Negative, which refers to pixels that are correctly predicted as background.

[0132] FP: False Positive, which refers to background pixels that are incorrectly predicted as the target region.

[0133] FN: False Negative, which refers to the target region pixels that are incorrectly predicted as background.

[0134] The formula for calculating IOU is:

[0135]

[0136] Where A represents the prediction region.

[0137] B: The actual labeled area.

[0138] |A∩B|: The number of pixels in the intersection region of the predicted region and the ground truth labeled region; |A∪B|: The number of pixels in the union region of the predicted region and the ground truth labeled region.

[0139] Based on the above indicators, experiments were conducted to compare the baseline method and the method of this patent, and the final experimental data are as follows:

[0140] method Segmentation accuracy IOU MoSeg 49.43 26.0 SimpleBEV_Motion 73.79 60.24 This patent model 75.35 62.59

[0141] The above description is merely a further embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any substitutions or modifications made by those skilled in the art within the scope disclosed in the present invention, based on the technical solution and concept of the present invention, shall fall within the scope of protection of the present invention.

Claims

1. A method for constructing a zero-shot semantic segmentation map by combining a road visual language model with dynamic objects, characterized in that: Includes the following steps: Step 1: Input the degraded loop image into the vector quantization-based image restoration model; Step 2: Input the restored image into the road visual language model; Step 3: Based on the visual feature map with semantic descriptions extracted above, and combined with LiDAR and millimeter-wave radar data, integrate multimodal information through a multimodal fusion strategy; Step 4: Use a BEV encoder to solve the feature map alignment problem and extract temporal features; Step 5: Fuse the associated features with the current frame's BEV features, and process them through a convolutional layer to obtain the moving target segmentation map.

2. The method for constructing a zero-shot semantic segmentation map combining a road visual language model and dynamic objects according to claim 1, characterized in that: In step one, vector quantization and optimization techniques are used to effectively restore the image features under severe weather conditions to those under normal weather conditions. In step two, the road visual language model semantically describes information about vehicles, sidewalks, and road boundaries on the road.

3. The method for constructing a zero-shot semantic segmentation map combining a road visual language model and dynamic objects according to claim 1, characterized in that: In step one, the goal of image feature extraction is to extract important information from the input image that can reflect the core content of the image. Assuming the input image is set as I, the CNN will gradually map the original image into a high-dimensional feature space through several layers of transformation operations. Specifically, it will extract image features by using convolution and pooling operations. Assuming that each image I generates a feature vector after undergoing the corresponding processing steps in the network. Here, d represents the dimension of the feature space; In the construction of the normal image codebook, the process is to train with a large number of images under normal weather conditions, to understand the feature distribution of the images under normal weather conditions, and to assume that there is a training set D normal = {I1, I2,..., IN} N , which is composed of N images under normal weather conditions; After each image Ii undergoes the CNN feature extraction process, a feature vector is obtained. extract the feature vectors of all normal images Construct a feature matrix Each row of this matrix represents a feature of a normal weather image; After clustering operation on these eigenvectors, a codebook consisting of K cluster centers is obtained, each of which C K represents a certain mode of image features under normal weather conditions. The codebook is defined as a set. Each of them The input vector is mapped to a finite set of vectors, and this finite set of vectors is called the codebook; Define the input severe weather image as I bad After feature extraction by CNN, its feature vector is obtained. To recover the features of severe weather images, they need to be mapped into the feature space of normal weather images. The specific implementation process is achieved by using vector quantization. The goal of vector quantization is to find the codebook vector that is closest to the feature vector F bad The closest cluster center C k This process can be represented by the following mathematical equation: In this context, This represents the feature vector obtained after vector quantization, while |F bad -C k |Refers to F bad With a certain cluster center C in the codebook k Given the existing Euclidean distance, we select the cluster center C with the smallest distance. k This achieved the mapping of features from severe weather images.

4. The method for constructing a zero-shot semantic segmentation map combining a road visual language model and dynamic objects according to claim 3, characterized in that: The design of the loss function needs to consider how to reduce the differences between the features of severe weather images and those of normal weather images; Using the Euclidean distance loss function as the distance loss function, it is defined as follows: in, It refers to the feature vector corresponding to the i-th severe weather image, but it is the recovered feature obtained by means of vector quantization; The purpose of this loss function is to minimize the difference between the features of severe weather images and the recovered features. By adding a regularization term to ensure the stability and consistency of the recovered features, and using L2 regularization, the objective function for optimization can be expressed as: L exists as a regularization coefficient, while θ belongs to the category of model parameters. It refers to the L2 norm of the model parameters; The image features presented in extreme weather conditions, with the help of image restoration models, after a series of processes of vector quantization and optimization, will gradually develop towards the direction of image feature distribution F normal under normal weather conditions. The features obtained after the restoration operation can be represented as follows: By using vector quantization and optimization techniques, image features under adverse weather conditions can be effectively restored to those under normal weather conditions, thereby improving the perception performance of autonomous driving under adverse weather conditions.

5. The method for constructing a zero-shot semantic segmentation map combining a road visual language model and dynamic objects according to claim 1, characterized in that: Step two includes inputting the restored image into the road visual language model, whereby the model begins to perform in-depth analysis of the road scene in the image and generates a semantic description. This process first involves the identification and localization of various elements in the image, including vehicles, pedestrians, sidewalks, lane lines, traffic signs, and road boundaries; The model extracts image features using convolutional neural network visual processing algorithms to identify the specific location and category of each object; Next, the visual language model will generate corresponding semantic descriptions based on these visual features; The key to visual language models lies in their deep learning architecture, which is able to understand the relationship between each element in an image and its surrounding environment; By training on large-scale datasets, the model continuously learns how to extract useful information from images and transform this information into concise and accurate language descriptions.

6. The method for constructing a zero-shot semantic segmentation map combining a road visual language model and dynamic objects according to claim 1, characterized in that: Step three includes: by introducing semantic labels, the representational power of the visual feature map is enhanced; Data fusion is achieved through a variety of strategies: Feature-level fusion refers to combining data features from different sensors. Specifically, visual feature maps and radar point cloud data are usually processed together in a deep neural network. Convolutional neural network models automatically select the influence of different sensor data through attention mechanisms or weighted averaging strategies, thereby determining how to process various types of information based on the specific characteristics of the scene; Decision-level fusion involves merging the prediction results output by different sensor data after they have been processed independently. In this process, the perception modules of vision, radar and millimeter-wave radar perform target detection, localization and tracking tasks respectively, and finally integrate this information into the final decision output. The BEV encoder maps the 3D point cloud data of LiDAR, visual images, and distance information of millimeter-wave radar to the same coordinate system. The BEV encoder plays a crucial role in the alignment of multimodal data.