Crowdsourced visual image based online construction method of autonomous driving vector map
By proposing an online vector map construction method for autonomous driving based on crowdsourced visual images, this method utilizes map feature detection, feature encoding, and relative pose estimation networks to generate high-precision local vector maps. This solves the problems of high cost and slow update of high-precision map construction, and achieves low-cost, comprehensive map construction that supports real-time updates for autonomous driving.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TSINGHUA UNIVERSITY
- Filing Date
- 2023-05-15
- Publication Date
- 2026-06-12
AI Technical Summary
Existing high-precision autonomous driving maps are costly to build and slow to update, failing to meet the need for real-time updates. The limited field of vision of a single vehicle also prevents it from providing comprehensive environmental map information, thus hindering the development of autonomous driving.
By acquiring visual images of the current target vehicle and adjacent source vehicles, map features are extracted using a map feature detection module. Combined with a feature encoder and a relative pose estimation network, a relative pose transformation result is generated. A vectorized map is then constructed using a viewpoint transformation and feature fusion module. An end-to-end mapping scheme is adopted, and network technology and pure vision solutions are used to reduce costs.
It enables low-cost, high-precision online construction of vector maps for autonomous driving, breaking through the limits of single-vehicle perception. The map construction is more comprehensive and accurate, supporting the real-time update needs of autonomous vehicles.
Smart Images

Figure CN116753936B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of map building technology, and in particular to an online method for constructing vector maps for autonomous driving based on crowdsourced visual images. Background Technology
[0002] Because autonomous vehicles struggle to consistently provide comprehensive and accurate environmental information in complex real-world scenarios, high-precision maps are essential for advanced autonomous driving tasks. Autonomous driving maps, containing lane lines, traffic signs, and other map elements, help compensate for the limitations of autonomous vehicles' perception capabilities and improve the performance of downstream decision-making and planning tasks. To efficiently store map element information, map elements in high-precision autonomous driving maps are generally in vector format. However, currently, high-precision maps in the industry... Figure 1 Generally, environmental mapping is done using specialized data collection vehicles, which presents challenges such as difficulty in data collection and slow updates, thus hindering the development of advanced autonomous driving.
[0003] The difficulty in building high-precision maps lies in several aspects: First, specialized vehicles equipped with high-precision inertial navigation systems, LiDAR, and visual cameras collect environmental data to create point cloud maps. These maps are then processed through map element segmentation and detection, clustering, manual verification, and annotation to generate vector maps. This entire process requires significant manpower and resources, and only a small number of qualified cartographic manufacturers can produce high-precision maps. Second, the slow update speed of high-precision maps is problematic. Because map creation is resource- and time-consuming, maps struggle to keep up with changes in the real-world environment. Currently, the mainstream map update strategy involves periodically re-mapping the environment, identifying and verifying changes between the old and new maps, and updating the changed portions. This strategy is clearly insufficient for real-time map updates. If updates are delayed, conflicts between the perception results of autonomous vehicles and the map can interfere with localization and decision-making. Furthermore, the limited field of view of a single vehicle during operation prevents it from providing comprehensive environmental map information, hindering the development of autonomous driving. Summary of the Invention
[0004] This invention provides an online method for constructing vector maps for autonomous driving based on crowdsourced visual images, which solves the problems of high cost and poor accuracy in existing autonomous driving map construction.
[0005] This invention provides an online method for constructing vector maps for autonomous driving based on crowdsourced visual images, comprising:
[0006] The system acquires visual images of the current target vehicle and adjacent source vehicles, and extracts map element detection results from the visual images using a preset map element detection module.
[0007] The map element detection results are input into a preset feature encoder to extract the visual features of each vehicle from its forward view.
[0008] The map feature detection results are input into a pre-trained relative pose estimation network to generate the relative pose transformation results between the target vehicle and each source vehicle.
[0009] The forward visual features of the target vehicle are projected to the bird's-eye view through the preset view transformation module to obtain the first bird's-eye view visual features of the target vehicle. Based on the relative pose transformation result, the forward visual features of the source vehicle are projected to the bird's-eye view of the target vehicle to obtain the second bird's-eye view visual features of the source vehicle in the coordinates of the target vehicle.
[0010] All bird's-eye view visual features are input into the feature fusion module for feature fusion to obtain fused bird's-eye view features. The fused bird's-eye view features are then input into the map building task-related modules and vectorized to obtain a vectorized map.
[0011] According to the present invention, an online method for constructing vector maps for autonomous driving based on crowdsourced visual images includes acquiring visual images collected by the current target vehicle and adjacent source vehicles, and extracting map feature detection results from the visual images through a preset map feature detection module. Specifically, this includes:
[0012] The system acquires the visual image and current location information of the target vehicle captured by a monocular camera, and extracts the map element detection results of the target vehicle from the visual image through the map element detection module.
[0013] The system acquires visual images and current location information of neighboring vehicles captured by a monocular camera, and extracts the map element detection results of the source vehicles from the visual images using a map element detection module.
[0014] The map feature detection results of the target vehicle and the source vehicle are both uploaded to the cloud platform, and each vehicle can obtain the map feature detection results of other vehicles from the cloud platform.
[0015] According to the present invention, an online method for constructing vector maps for autonomous driving based on crowdsourced visual images is provided, wherein the map feature detection results are input into a preset feature encoder to extract visual features from the forward-looking perspective of each vehicle, specifically including:
[0016] The map feature detection results of both the target vehicle and the source vehicle are input into the feature encoder;
[0017] The feature encoder extracts the visual features from the target vehicle's forward view and the source vehicle's forward view.
[0018] According to the present invention, an online vector map construction method for autonomous driving based on crowdsourced visual images is provided, wherein the map feature detection results are input into a pre-trained relative pose estimation network to generate relative pose transformation results between the target vehicle and each source vehicle, specifically including:
[0019] The relative pose estimation network takes the map feature detection results as input and generates a transformation matrix between the two coordinate systems of the target vehicle and the source vehicle.
[0020] The visual image is input into a preset monocular depth estimation network to generate a corresponding depth map, thus completing the viewpoint conversion between visual images;
[0021] The relative pose estimation network and the monocular depth estimation network are trained by introducing joint supervision from laser point cloud data, including depth scale supervision from absolute pose data, appearance consistency supervision, and smoothness supervision.
[0022] According to the present invention, an online method for constructing vector maps for autonomous driving based on crowdsourced visual images involves inputting all bird's-eye view visual features into a feature fusion module for feature fusion to obtain fused bird's-eye features. These fused bird's-eye features are then input into a map construction task-related module, and a vectorized map is obtained through vectorization processing. Specifically, the method includes:
[0023] All bird's-eye view visual features are input into the feature fusion module for feature fusion. The visual features of each vehicle are stitched together in a new dimension to obtain the crowdsourced visual features.
[0024] Max pooling is performed on the last dimension of the crowdsourced visual features to obtain the fused bird's-eye view features.
[0025] The fused bird's-eye view features are input into the map feature segmentation module and the feature height prediction module respectively to obtain the map feature segmentation results and the corresponding heights of the map features from the bird's-eye view. The map feature segmentation results are then vectorized to obtain a vectorized map.
[0026] According to the present invention, an online method for constructing vector maps for autonomous driving based on crowdsourced visual images is provided. The map feature segmentation module includes a series of convolutional neural network layers, batch regularization, and nonlinear activation layers to obtain a K+1 dimensional segmentation heatmap. Where K represents the map feature types, each dimension of the first K-dimensional heatmap represents the segmentation result of each type of map feature in the local map under the BEV coordinate system, and the last dimension represents the background. For each dimension of the heatmap, the binary cross-entropy loss function is used for semantic segmentation supervision.
[0027] The feature height prediction module includes a convolutional neural network layer, batch regularization, and a nonlinear activation layer to obtain a one-dimensional height map. L1 distance is used to monitor depth information in areas containing map features.
[0028] This invention also provides an online vector map construction system for autonomous driving based on crowdsourced visual images, the system comprising:
[0029] The map feature extraction module is used to acquire visual images of the current target vehicle and adjacent source vehicles respectively, and extract map feature detection results from the visual images through a preset map feature detection module.
[0030] The forward visual feature acquisition module is used to input the map element detection results into a preset feature encoder to extract the visual features of each vehicle from the forward view.
[0031] The relative pose transformation module is used to input the map feature detection results into a pre-trained relative pose estimation network to generate the relative pose transformation results between the target vehicle and each source vehicle.
[0032] The viewpoint transformation module is used to project the forward visual features of the target vehicle to the bird's-eye view through a preset viewpoint transformation module to obtain the first bird's-eye view visual features of the target vehicle. Based on the relative pose transformation result, the forward visual features of the source vehicle are projected to the bird's-eye view of the target vehicle to obtain the second bird's-eye view visual features of the source vehicle in the coordinates of the target vehicle.
[0033] The map generation module is used to input all bird's-eye view visual features into the feature fusion module for feature fusion to obtain fused bird's-eye view features. The fused bird's-eye view features are then input into the map building task-related modules, and after vectorization processing, a vectorized map is obtained.
[0034] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the online construction method for autonomous driving vector maps based on crowdsourced visual images as described above.
[0035] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the online construction method for autonomous driving vector maps based on crowdsourced visual images as described above.
[0036] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the online construction method for autonomous driving vector maps based on crowdsourced visual images as described above.
[0037] This invention provides an online vector map construction method for autonomous driving based on crowdsourced visual images. It constructs a high-precision local vector map of the vicinity of an autonomous vehicle. The system uses a pure vision-based approach, significantly reducing mapping costs. It employs an end-to-end mapping scheme, eliminating the need for excessive manual intervention and complex processing. It can automatically estimate the relative poses between crowdsourced visual data. During the training of the relative pose estimation module, additional laser point cloud data and ground truth pose data are added to apply depth-scale supervision and direct pose supervision, resulting in high-precision relative poses. Utilizing connected vehicle technology and crowdsourced visual data, it overcomes the limitations of single-vehicle perception, leading to more comprehensive and accurate map construction. Using binary mask images as system input and for transmission between connected vehicles reduces bandwidth requirements and lowers costs associated with visual images. Attached Figure Description
[0038] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0039] Figure 1 This is one of the flowcharts illustrating an online method for constructing vector maps for autonomous driving based on crowdsourced visual images provided by the present invention.
[0040] Figure 2 This is the second flowchart of an online vector map construction method for autonomous driving based on crowdsourced visual images provided by the present invention;
[0041] Figure 3 This is the third flowchart of an online vector map construction method for autonomous driving based on crowdsourced visual images provided by the present invention;
[0042] Figure 4 This is the fourth flowchart of an online vector map construction method for autonomous driving based on crowdsourced visual images provided by the present invention;
[0043] Figure 5 This is the fifth flowchart illustrating an online method for constructing vector maps for autonomous driving based on crowdsourced visual images, provided by the present invention.
[0044] Figure 6 This is a schematic diagram of the module connections of an online vector map construction system for autonomous driving based on crowdsourced visual images provided by the present invention;
[0045] Figure 7 This is a schematic diagram of the online vector map construction method based on crowdsourced visual data provided by the present invention;
[0046] Figure 8 This is a flowchart of the process of training a relative pose estimation network using absolute depth estimation, provided by the present invention.
[0047] Figure 9 This is a schematic diagram of the network structure of the relative pose estimation module provided by the present invention;
[0048] Figure 10 This invention provides online crowdsourced mapping results for a crowdsourced mapping network.
[0049] Figure 11 This is a schematic diagram of the structure of the electronic device provided by the present invention.
[0050] Figure label:
[0051] 110: Map feature extraction module; 120: Forward visual feature acquisition module; 130: Relative pose transformation module; 140: Viewpoint transformation module; 150: Map generation module;
[0052] 1110: Processor; 1120: Communication interface; 1130: Memory; 1140: Communication bus. Detailed Implementation
[0053] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0054] The following is combined Figures 1-5 This invention describes an online method for constructing vector maps for autonomous driving based on crowdsourced visual images, comprising:
[0055] S100: Acquire visual images of the current target vehicle and adjacent source vehicles respectively, and extract map element detection results from the visual images through a preset map element detection module;
[0056] S200. Input the map element detection results into a preset feature encoder to extract the visual features of each vehicle from the forward view.
[0057] S300. Input the map feature detection results into a pre-trained relative pose estimation network to generate the relative pose transformation results between the target vehicle and each source vehicle.
[0058] S400: Project the forward visual features of the target vehicle to the bird's-eye view through the preset view transformation module to obtain the first bird's-eye view visual features of the target vehicle. Based on the relative pose transformation result, project the forward visual features of the source vehicle to the bird's-eye view of the target vehicle to obtain the second bird's-eye view visual features of the source vehicle in the target vehicle coordinates.
[0059] S500: Input all bird's-eye view visual features into the feature fusion module for feature fusion to obtain fused bird's-eye features. Input the fused bird's-eye features into the map building task-related module and obtain a vectorized map after vectorization processing.
[0060] This invention presents a scheme for constructing high-precision vector maps online based on the visual perception results of autonomous vehicles. This scheme uses a neural network model to project the visual perception results from the forward-looking perspective onto a bird's-eye view (BEV) perspective, and then performs map element detection and vectorization to obtain a local vector map centered on the vehicle. Simultaneously, since the perception field of a single vehicle is limited during operation and cannot provide comprehensive environmental map information, this invention proposes to utilize crowdsourced visual data and deep learning technology to obtain the relative poses of each vehicle, efficiently fusing the perception results of multiple vehicles into a unified coordinate system. This overcomes the limitations of single-vehicle perception and constructs a more comprehensive high-precision map. The proposed mapping algorithm requires only visual images as input, without any other human intervention, and can perceive and construct local vector maps online. It is low-cost, rapidly updated, and can better serve autonomous vehicles, promoting the development of the autonomous driving industry.
[0061] The system acquires visual images of the current target vehicle and adjacent source vehicles, and extracts map feature detection results from these images using a pre-defined map feature detection module. Specifically, this includes:
[0062] S101. Obtain the visual image and current location information of the target vehicle captured by the monocular camera, and extract the map element detection results of the target vehicle from the visual image through the map element detection module.
[0063] S102. Obtain visual images and current location information of adjacent source vehicles captured by a monocular camera, and extract the map element detection results of the source vehicles from the visual images through the map element detection module.
[0064] S103. Upload the map element detection results of the target vehicle and the source vehicle to the cloud platform. Each vehicle can obtain the map element detection results of other vehicles from the cloud platform.
[0065] refer to Figure 7This invention requires each vehicle to transmit map feature detection results (generally binary masks) and their approximate locations (which can be rough locations obtained through satellite positioning or region / scene IDs obtained through scene recognition algorithms) to the cloud platform. This invention uses binary mask images as input, significantly reducing the bandwidth required for data transmission and the memory required for data storage, making it more suitable for connected applications. When using the online mapping function, the target vehicle sends a request to the cloud platform to obtain the map feature detection results of the source vehicle located in the same region / scene (similar in location) for mapping, eliminating the need for other high-cost information such as prior maps and high-precision poses.
[0066] In this invention, the current autonomous vehicle t (referred to as the target vehicle for ease of distinction) acquires visual image I through a monocular camera. t Through its built-in map feature detection module Obtain the detection results of map features. Meanwhile, in N nearby autonomous vehicles s i The same process runs on the i∈[1,N] (referred to as the source vehicle in this invention), which processes the visual image to obtain the map feature detection results. Since the map feature detection results are generally binary masks, they are smaller and have less noise than the original visual image files, and have lower requirements for data storage and transmission. This invention only transmits and shares the map feature detection results between the target vehicle and the source vehicle.
[0067] The map element detection results are input into a preset feature encoder to extract visual features from the forward-facing view of each vehicle, specifically including:
[0068] S201. Input the map feature detection results of the target vehicle and the map feature detection results of the source vehicle into the feature encoder;
[0069] S202. Visual features from the forward view of the target vehicle and visual features from the forward view of the source vehicle are extracted using the feature encoder.
[0070] In this invention, when the target vehicle obtains the map feature detection results of both the target vehicle and the source vehicle... Input it into the feature encoder Extract visual features from the forward-facing view of each vehicle
[0071] The map feature detection results are input into a pre-trained relative pose estimation network to generate relative pose transformation results between the target vehicle and each source vehicle, specifically including:
[0072] S301. The relative pose estimation network takes the map feature detection results as input and generates a transformation matrix between the two coordinate systems of the target vehicle and the source vehicle.
[0073] S302. Input the visual image into the preset monocular depth estimation network to generate the corresponding depth map and complete the viewpoint conversion between visual images;
[0074] S303. The relative pose estimation network and the monocular depth estimation network are trained by introducing joint supervision from laser point cloud data, including depth scale supervision from absolute pose data, pose direct supervision from absolute pose data, appearance consistency supervision, and smoothness supervision.
[0075] In this invention, since the accuracy of relative pose estimation is crucial for crowdsourced data fusion, a relative pose estimation module needs to be trained first. This module takes the map feature detection results from two viewpoints as input and outputs the relative pose of the two viewpoint cameras. Figure 8 As shown, in this invention, the monodepth2 algorithm is referenced to train the relative pose estimation using the appearance consistency of continuous images, and the supervision of laser point cloud and ground truth pose is added to obtain the relative pose with absolute scale.
[0076] Given two consecutive visual images I0 and I1, and the laser point cloud L0 corresponding to I0, and the ground truth relative pose of the two visual images. Using map feature detection module Obtain binary map feature masks M0, M1 ∈ {0, 1} from I0 and I1, where 1 represents a map feature and 0 represents background. Input I0 and I1 into a monocular depth estimation network. The corresponding depth maps D0 and D1 are obtained. This invention does not restrict the monocular depth estimation network; any monocular depth estimation network can be applied to this invention. Simultaneously, M0 and M1 are input into the relative pose estimation module. The relative pose of the cameras between i0 and i1 is obtained, which is the transformation matrix between the two camera coordinate systems.
[0077] A schematic diagram of the relative pose estimation network structure is shown below. Figure 9 As shown, the relative pose estimation network is a two-input, one-output convolutional neural network model. When two binary mask images are input into the relative pose estimation module, they are processed by convolutional neural networks to obtain visual feature maps of the two masks. Then, the two visual feature maps are concatenated along the channel dimension and input into another convolutional neural network to obtain a fused feature map. After passing through a pooling layer to flatten the fused feature map, it is input into a fully connected layer to obtain two outputs, namely the pose vector represented by unit quaternions. and three-dimensional position vector Convert the attitude vector q into a rotation matrix. Together with the position vector p, they form the transformation matrix:
[0078]
[0079] The relative pose estimation module and the monocular depth estimation module are trained simultaneously using the monodepth2 algorithm. Given the camera intrinsic parameter K, based on the predicted depth D0 of image I0, a point in image I0 can be projected into image I1, resulting in two pixels p0 and p1 in images I0 and I1 corresponding to the same point in the real world. Based on the pixel-level correspondence between the two images obtained through these steps, image I1 is transformed into the viewpoint of image I0, resulting in the transformed image I′1. The relative pose estimation module and the monocular depth estimation module are jointly trained using appearance consistency supervision and smoothness supervision from the monodepth2 algorithm.
[0080] However, the monodepth2 algorithm using a monocular camera has difficulty obtaining depth and pose information with absolute scale. Given that sparse, high-precision LiDAR data and absolute pose data can be obtained through a professional data acquisition vehicle during the training phase, depth scale supervision from LiDAR point cloud data and direct pose supervision from absolute pose data are added to the monodepth2 algorithm.
[0081] For depth-scale supervision, the laser point cloud data acquired simultaneously with image I0 is transformed onto the image plane of image I0 to obtain the ground-value depth map. Only for ground truth depth maps For points with a non-zero depth, the L1 distance between them and the predicted depth D0 is calculated, and depth scale supervision is applied. Furthermore, to further improve the estimation accuracy of the relative pose estimation module, direct supervision of the relative pose is added. Using algorithms such as visual odometry or relocalization, the ground truth relative pose of the two images can be obtained. Calculate its relative pose T with the predicted pose. 0,1 The differences between them allow for direct pose supervision.
[0082] During the training phase of the relative pose estimation network, appearance consistency supervision, smoothness supervision, and direct pose supervision are continuously applied to train the network on image sequences. When LiDAR data corresponding to a certain moment exists in the image, depth scale supervision is added. After the network training converges, the monocular depth estimation network is discarded, and only the relative pose estimation module is retained.
[0083] The detection results of map elements for each pair of self-vehicle and source vehicle Input relative pose estimation module Obtain the relative pose transformation results between the target vehicle and each source vehicle. During the perspective change, the forward visual features of the target vehicle are... Projecting the image onto the BEV (Battery Electric Vehicle) viewpoint of the target vehicle yields its BEV-view visual features. And based on the relative pose transformation results Source car i Forward visual features By projecting the BEV viewpoint onto the target vehicle, the BEV visual features of the source vehicle in the target vehicle's coordinate system are obtained.
[0084] All bird's-eye view visual features are input into the feature fusion module for feature fusion, resulting in fused bird's-eye view features. These fused features are then input into the map building task-related modules, where they undergo vectorization processing to obtain a vectorized map. Specifically, this includes:
[0085] S401. Input all bird's-eye view visual features into the feature fusion module for feature fusion. The visual features of each vehicle are stitched together in a new dimension to obtain the source visual features.
[0086] S402. Perform max pooling on the last dimension of the crowdsourced visual features to obtain the fused bird's-eye view features.
[0087] S403. Input the fused bird's-eye view features into the map feature segmentation module and the feature height prediction module respectively to obtain the map feature segmentation results and the corresponding heights of the map features from the bird's-eye view. Vectorize the map feature segmentation results to obtain a vectorized map.
[0088] In this invention, all BEV visual features are... Input Feature Fusion Module Obtain the fused BEV features And input it into the map feature segmentation module related to the map building task. and element height prediction module Obtain the map feature segmentation result S from the BEV perspective. t Height H corresponding to map features t , for S t By vectorizing, a vectorized map can be obtained.
[0089] Specifically, the training phase of the Crowdsource Mapping Network (CSM) is similar to that of HDMapNet, but the input of CSM is crowdsourced visual data, rather than panoramic image data; the output of CSM also includes the height of map features, so that the local map estimated by CSM is three-dimensional, rather than the two-dimensional BEV map of HDMapNet.
[0090] When the target vehicle obtains the map feature detection results M of both the source vehicle and the target vehicle... t , H and W represent the height and width of the original image. The map feature detection results are then input into the feature encoder module. Extracting forward visual features of each vehicle Where H fv and W fv This represents the height and width of the forward feature map. Simultaneously, it includes the detection results of map features for each pair of target vehicle-source vehicle maps. Input relative pose estimation module Obtain the relative pose transformation between the target vehicle and each source vehicle. During the viewpoint transformation process, the forward visual features are processed by a neural network. Converted to BEV visual features Where H bv and W bv The height and width of the BEV feature map are defined. The Inverse Perspective Mapping (IPM) algorithm is used to transform the BEV visual features from the camera image coordinate system to the BEV coordinate system. In the CSM, the forward visual features of each vehicle are determined based on the relative pose T. Projecting the BEV viewpoint onto the target vehicle, we obtain the BEV visual features of each vehicle's crowdsourced visual data in the target vehicle's BEV coordinate system. By stitching together the BEV visual features of each vehicle in a new dimension, we obtain the ZOOYU BEV visual features. When performing feature fusion, the visual features of the crowdsourced BEV are analyzed. The last dimension is then subjected to max pooling to obtain the fused BEV features. Finally Input map feature segmentation module related to mapping task and element height prediction module
[0091] Map feature segmentation module Composed of a series of convolutional neural network layers, batch regularization, and nonlinear activation layers, a K+1 dimensional segmentation heatmap is obtained. Where K represents the map feature types, each dimension of the first K-dimensional heatmap represents the segmentation result of each type of map feature in the local map under the BEV coordinate system, and the last dimension represents the background. For each dimension of the heatmap, semantic segmentation is supervised using the binary cross-entropy loss function. Feature height prediction module. Also composed of convolutional neural network layers, batch regularization, and nonlinear activation layers, it yields a one-dimensional height map. Since local vector maps only concern the height of map features, L1 distance is used to monitor the depth information of areas containing map features.
[0092] In a specific embodiment, data was collected in an autonomous driving demonstration zone, with a total trajectory length of 20 kilometers, including multiple visual data sessions. Each data session was conducted under different visual conditions. In some streets with relatively ideal structure and texture, LiDAR and high-precision satellite and inertial navigation equipment were used to acquire LiDAR data and high-precision pose data for training the relative pose estimation module. A high-precision vector map was constructed using traditional visual mapping schemes. For each image, the vector map of the vicinity of that image was extracted, and the local vector map was transformed to the BEV coordinates of the current image based on the high-precision pose, serving as the data source for training the overall model.
[0093] To verify the effectiveness of the invention, validation was conducted on the collected dataset. The mapping capability of CSM was tested on streets outside the network training data, using different amounts of crowdsourced data. Experimental results are as follows: Figure 10 In the true value, the target car is located at the center of the image, and the positions of each car are marked with white dots.
[0094] As can be seen, when the amount of crowdsourced data is small (N=4), the constructed local map is incomplete due to the limited field of vision of each vehicle. As the amount of crowdsourced data increases, the data complements each other, and the constructed local map gradually becomes more complete. The same trend is observed in the height prediction of map features; as the amount of crowdsourced data increases, the height prediction of map features becomes more complete and accurate.
[0095] This invention provides an online vector map construction method for autonomous driving based on crowdsourced visual images, which constructs a high-precision local vector map near autonomous vehicles. The system uses a pure vision-based approach, significantly reducing mapping costs. It employs an end-to-end mapping scheme, eliminating the need for excessive manual intervention and complex processing. It can automatically estimate the relative poses between crowdsourced visual data. During the training of the relative pose estimation module, additional laser point cloud data and ground truth pose data are added to apply depth-scale supervision and direct pose supervision, resulting in high-precision relative poses. Utilizing connected vehicle technology and crowdsourced visual data, it overcomes the limitations of single-vehicle perception, leading to more comprehensive and accurate map construction. Using binary mask images as system input and for transmission between connected vehicles reduces bandwidth requirements and lowers costs associated with visual images.
[0096] refer to Figure 6 The present invention also discloses an online vector map construction system for autonomous driving based on crowdsourced visual images, the system comprising:
[0097] The map element extraction module 110 is used to acquire visual images collected by the current target vehicle and adjacent source vehicles respectively, and extract map element detection results from the visual images through a preset map element detection module.
[0098] The forward visual feature acquisition module 120 is used to input the map element detection results into a preset feature encoder to extract the visual features of each vehicle from the forward view.
[0099] The relative pose transformation module 130 is used to input the map feature detection results into a pre-trained relative pose estimation network to generate relative pose transformation results between the target vehicle and each source vehicle.
[0100] The viewpoint conversion module 140 is used to project the forward visual features of the target vehicle to the bird's-eye view through a preset viewpoint conversion module to obtain the first bird's-eye view visual features of the target vehicle, and to project the forward visual features of the source vehicle to the bird's-eye view of the target vehicle according to the relative pose transformation result to obtain the second bird's-eye view visual features of the source vehicle in the coordinates of the target vehicle.
[0101] The map generation module 150 is used to input all bird's-eye view visual features into the feature fusion module for feature fusion to obtain fused bird's-eye features. The fused bird's-eye features are then input into the map building task-related modules and vectorized to obtain a vectorized map.
[0102] Among them, the map element extraction module 110 acquires the visual image and current location information of the current target vehicle collected by the monocular camera, and extracts the map element detection results of the target vehicle from the visual image through the map element detection module.
[0103] The system acquires visual images and current location information of neighboring vehicles captured by a monocular camera, and extracts the map element detection results of the source vehicles from the visual images using a map element detection module.
[0104] The map feature detection results of the target vehicle and the source vehicle are both uploaded to the cloud platform, and each vehicle can obtain the map feature detection results of other vehicles from the cloud platform.
[0105] The forward-looking visual feature acquisition module 120 inputs the map feature detection results of the target vehicle and the map feature detection results of the source vehicle into the feature encoder.
[0106] The feature encoder extracts the visual features from the target vehicle's forward view and the source vehicle's forward view.
[0107] The relative pose transformation module 130, wherein the relative pose estimation network takes the map feature detection results as input and generates a transformation matrix between the two coordinate systems of the target vehicle and the source vehicle;
[0108] The visual image is input into a preset monocular depth estimation network to generate a corresponding depth map, thus completing the viewpoint conversion between visual images;
[0109] The relative pose estimation network and the monocular depth estimation network are trained by introducing joint supervision from laser point cloud data, including depth scale supervision from absolute pose data, appearance consistency supervision, and smoothness supervision.
[0110] The map generation module 150 inputs all bird's-eye view visual features into the feature fusion module for feature fusion, stitching together the visual features of each vehicle in a new dimension to obtain crowdsourced visual features.
[0111] Max pooling is performed on the last dimension of the crowdsourced visual features to obtain the fused bird's-eye view features.
[0112] The fused bird's-eye view features are input into the map feature segmentation module and the feature height prediction module respectively to obtain the map feature segmentation results and the corresponding heights of the map features from the bird's-eye view. The map feature segmentation results are then vectorized to obtain a vectorized map.
[0113] Traditional high-precision mapping schemes focus on building global high-precision vector maps, while this invention focuses on building local high-precision vector maps near autonomous vehicles. This system uses a pure vision-based approach, significantly reducing mapping costs. It employs an end-to-end mapping scheme, eliminating the need for excessive manual intervention and complex processing. This system can automatically estimate the relative poses between crowdsourced visual data. During the training of the relative pose estimation module, additional laser point cloud data and ground truth pose data are added to apply depth-scale supervision and direct pose supervision, resulting in high-precision relative poses. This system utilizes connected vehicle technology and crowdsourced visual data, breaking through the limitations of single-vehicle perception, resulting in more comprehensive and accurate map construction. This system uses binary mask images as system input and for transmission between connected vehicles, thus requiring lower transmission bandwidth.
[0114] Figure 11 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 11 As shown, the electronic device may include a processor 1110, a communication interface 1120, a memory 1130, and a communication bus 1140, wherein the processor 1110, the communication interface 1120, and the memory 1130 communicate with each other through the communication bus 1140. The processor 1110 can call logical instructions in the memory 1130 to execute an online construction method for autonomous driving vector maps based on crowdsourced visual images. The method includes: acquiring visual images collected by the current target vehicle and adjacent source vehicles, respectively, and extracting map feature detection results from the visual images through a preset map feature detection module.
[0115] The map element detection results are input into a preset feature encoder to extract the visual features of each vehicle from its forward view.
[0116] The map feature detection results are input into a pre-trained relative pose estimation network to generate the relative pose transformation results between the target vehicle and each source vehicle.
[0117] The forward visual features of the target vehicle are projected to the bird's-eye view through the preset view transformation module to obtain the first bird's-eye view visual features of the target vehicle. Based on the relative pose transformation result, the forward visual features of the source vehicle are projected to the bird's-eye view of the target vehicle to obtain the second bird's-eye view visual features of the source vehicle in the coordinates of the target vehicle.
[0118] All bird's-eye view visual features are input into the feature fusion module for feature fusion to obtain fused bird's-eye view features. The fused bird's-eye view features are then input into the map building task-related modules and vectorized to obtain a vectorized map.
[0119] Furthermore, the logical instructions in the aforementioned memory 1130 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0120] On the other hand, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being able to be stored on a non-transitory computer-readable storage medium, the computer program being executed by a processor, the computer being able to execute an online construction method for autonomous driving vector maps based on crowdsourced visual images provided by the above methods, the method including: acquiring visual images collected by the current target vehicle and adjacent source vehicles respectively, and extracting map element detection results from the visual images through a preset map element detection module;
[0121] The map element detection results are input into a preset feature encoder to extract the visual features of each vehicle from its forward view.
[0122] The map feature detection results are input into a pre-trained relative pose estimation network to generate the relative pose transformation results between the target vehicle and each source vehicle.
[0123] The forward visual features of the target vehicle are projected to the bird's-eye view through the preset view transformation module to obtain the first bird's-eye view visual features of the target vehicle. Based on the relative pose transformation result, the forward visual features of the source vehicle are projected to the bird's-eye view of the target vehicle to obtain the second bird's-eye view visual features of the source vehicle in the coordinates of the target vehicle.
[0124] All bird's-eye view visual features are input into the feature fusion module for feature fusion to obtain fused bird's-eye view features. The fused bird's-eye view features are then input into the map building task-related modules and vectorized to obtain a vectorized map.
[0125] In another aspect, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements an online construction method for autonomous driving vector maps based on crowdsourced visual images provided by the above methods. The method includes: acquiring visual images collected by the current target vehicle and adjacent source vehicles respectively, and extracting map element detection results from the visual images through a preset map element detection module.
[0126] The map element detection results are input into a preset feature encoder to extract the visual features of each vehicle from its forward view.
[0127] The map feature detection results are input into a pre-trained relative pose estimation network to generate the relative pose transformation results between the target vehicle and each source vehicle.
[0128] The forward visual features of the target vehicle are projected to the bird's-eye view through the preset view transformation module to obtain the first bird's-eye view visual features of the target vehicle. Based on the relative pose transformation result, the forward visual features of the source vehicle are projected to the bird's-eye view of the target vehicle to obtain the second bird's-eye view visual features of the source vehicle in the coordinates of the target vehicle.
[0129] All bird's-eye view visual features are input into the feature fusion module for feature fusion to obtain fused bird's-eye view features. The fused bird's-eye view features are then input into the map building task-related modules and vectorized to obtain a vectorized map.
[0130] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0131] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0132] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A method for online construction of vector maps for autonomous driving based on crowdsourced visual images, characterized in that, include: The system acquires visual images of the current target vehicle and adjacent source vehicles, and extracts map element detection results from the visual images using a preset map element detection module. The map element detection results are input into a preset feature encoder to extract the visual features of each vehicle from its forward view. The map feature detection results are input into a pre-trained relative pose estimation network to generate the relative pose transformation results between the target vehicle and each source vehicle. The forward visual features of the target vehicle are projected to the bird's-eye view through the preset view transformation module to obtain the first bird's-eye view visual features of the target vehicle. Based on the relative pose transformation result, the forward visual features of the source vehicle are projected to the bird's-eye view of the target vehicle to obtain the second bird's-eye view visual features of the source vehicle in the coordinates of the target vehicle. All bird's-eye view visual features are input into the feature fusion module for feature fusion to obtain fused bird's-eye features. The fused bird's-eye features are then input into the map building task-related modules and vectorized to obtain a vectorized map. The step of inputting the map feature detection results into a pre-trained relative pose estimation network to generate relative pose transformation results between the target vehicle and each source vehicle specifically includes: The relative pose estimation network takes the map feature detection results as input and generates a transformation matrix between the two coordinate systems of the target vehicle and the source vehicle. The visual image is input into a preset monocular depth estimation network to generate a corresponding depth map, thus completing the viewpoint conversion between visual images; The relative pose estimation network and the monocular depth estimation network are trained by introducing joint supervision from laser point cloud data, including depth scale supervision from absolute pose data, appearance consistency supervision, and smoothness supervision.
2. The online construction method for autonomous driving vector maps based on crowdsourced visual images according to claim 1, characterized in that, The process of acquiring visual images of the current target vehicle and adjacent source vehicles, and extracting map element detection results from the visual images using a preset map element detection module, specifically includes: The system acquires the visual image and current location information of the target vehicle captured by a monocular camera, and extracts the map element detection results of the target vehicle from the visual image through the map element detection module. The system acquires visual images and current location information of neighboring vehicles captured by a monocular camera, and extracts the map element detection results of the source vehicles from the visual images using a map element detection module. The map feature detection results of the target vehicle and the source vehicle are both uploaded to the cloud platform, and each vehicle can obtain the map feature detection results of other vehicles from the cloud platform.
3. The online construction method for autonomous driving vector maps based on crowdsourced visual images according to claim 1, characterized in that, The map element detection results are input into a preset feature encoder to extract visual features from the forward-facing view of each vehicle, specifically including: The map feature detection results of both the target vehicle and the source vehicle are input into the feature encoder; The feature encoder extracts the visual features from the target vehicle's forward view and the source vehicle's forward view.
4. The online construction method for autonomous driving vector maps based on crowdsourced visual images according to claim 1, characterized in that, All bird's-eye view visual features are input into the feature fusion module for feature fusion, resulting in fused bird's-eye view features. These fused features are then input into the map building task-related modules, where they undergo vectorization processing to obtain a vectorized map. Specifically, this includes: All bird's-eye view visual features are input into the feature fusion module for feature fusion. The visual features of each vehicle are stitched together in a new dimension to obtain the crowdsourced visual features. Max pooling is performed on the last dimension of the crowdsourced visual features to obtain the fused bird's-eye view features. The fused bird's-eye view features are input into the map feature segmentation module and the feature height prediction module respectively to obtain the map feature segmentation results and the corresponding heights of the map features from the bird's-eye view. The map feature segmentation results are then vectorized to obtain a vectorized map.
5. The online construction method for autonomous driving vector maps based on crowdsourced visual images according to claim 4, characterized in that, The map feature segmentation module includes a series of convolutional neural network layers, batch regularization, and nonlinear activation layers to obtain... Dimensional heatmap ,in K For map element types, the first K Each dimension of the heatmap represents the segmentation result of each type of map feature in the local map under the BEV coordinate system, and the last dimension represents the background. For each dimension of the heatmap, the binary cross-entropy loss function is used for semantic segmentation supervision. The feature height prediction module includes a convolutional neural network layer, batch regularization, and a nonlinear activation layer to obtain a one-dimensional height map. L1 distance is used to supervise depth information in areas containing map features.
6. An online vector map construction system for autonomous driving based on crowdsourced visual images, characterized in that, The system includes: The map feature extraction module is used to acquire visual images of the current target vehicle and adjacent source vehicles respectively, and extract map feature detection results from the visual images through a preset map feature detection module. The forward visual feature acquisition module is used to input the map element detection results into a preset feature encoder to extract the visual features of each vehicle from the forward view. The relative pose transformation module is used to input the map feature detection results into a pre-trained relative pose estimation network to generate the relative pose transformation results between the target vehicle and each source vehicle. The viewpoint transformation module is used to project the forward visual features of the target vehicle to the bird's-eye view through a preset viewpoint transformation module to obtain the first bird's-eye view visual features of the target vehicle. Based on the relative pose transformation result, the forward visual features of the source vehicle are projected to the bird's-eye view of the target vehicle to obtain the second bird's-eye view visual features of the source vehicle in the coordinates of the target vehicle. The map generation module is used to input all bird's-eye view visual features into the feature fusion module for feature fusion to obtain fused bird's-eye features. The fused bird's-eye features are then input into the map building task-related modules, and after vectorization processing, a vectorized map is obtained. The relative pose transformation module is also used by the relative pose estimation network to generate a transformation matrix between the target vehicle and the source vehicle coordinate systems, with the map feature detection results as input. The visual image is input into a preset monocular depth estimation network to generate a corresponding depth map, thus completing the viewpoint conversion between visual images; The relative pose estimation network and the monocular depth estimation network are trained by introducing joint supervision from laser point cloud data, including depth scale supervision from absolute pose data, appearance consistency supervision, and smoothness supervision.
7. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the online construction method for autonomous driving vector maps based on crowdsourced visual images as described in any one of claims 1 to 5.
8. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the online construction method for autonomous driving vector maps based on crowdsourced visual images as described in any one of claims 1 to 5.
9. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the online construction method for autonomous driving vector maps based on crowdsourced visual images as described in any one of claims 1 to 5.